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Preface 



Importance of embedded systems 

Embedded systems can be defined as information processing systems embed- 
ded into enclosing products such as cars, telecommunication or fabrication 
equipment. Such systems come with a large number of common character- 
istics, including real-time constraints, and dependability as well as efficiency 
requirements. Embedded system technology is essential for providing ubiq- 
uitous information, one of the key goals of modern information technology 
(IT). 

Eollowing the success of IT for office and workflow applications, embedded 
systems are considered to be the most important application area of informa- 
tion technology during the coming years. Due to this expectation, the term 
post-PC era was coined. This term denotes the fact that in the future, standard- 
PCs will be a less dominant kind of hardware. Processors and software will be 
used in much smaller systems and will in many cases even be invisible (this 
led to the term the disappearing computer). It is obvious that many technical 
products have to be technologically advanced to find customers’ interest. Cars, 
cameras, TV sets, mobile phones etc. can hardly be sold any more unless they 
come with smart software. The number of processors in embedded systems 
already exceeds the number of processors in PCs, and this trend is expected 
to continue. According to forecasts, the size of embedded software will also 
increase at a large rate. Another kind of Moore’s law was predicted: For many 
products in the area of consumer electronics the amount of code is doubling 
every two years [Vaandrager, 1998]. 

This importance of embedded systems is so far not well reflected in many of the 
current curricula. This book is intended as an aid for changing this situation. It 
provides the material for a first course on embedded systems, but can also be 
used by non-student readers. 
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Audience for this book 

This book intended for the following audienee: 

■ Computer science, computer engineering and electrical engineering stu- 
dents who would like to specialize in embedded systems. The book should 
be appropriate for third year students who do have a basic knowledge of 
computer hardware and software. This book is intended to pave the way 
for more advanced topics that should be covered in a follow-up course. 

■ Engineers who have so far worked on systems hardware and who have to 
move more towards software of embedded systems. This book should pro- 
vide enough background to understand the relevant technical publications. 

■ Professors designing a new curriculum for embedded systems. 

Curriculum integration of embedded systems 

The book assumes a basic understanding in the following areas (see fig. 0.1): 

■ electrical networks at the high-school level (e.g. Kirchhoff’s laws), 

■ operational amplifiers (optional), 

■ computer hardware, for example at the level of the introductory book by 
J.L. Hennessy and D.A. Patterson [Hennessy and Patterson, 1995], 

■ fundamental digital circuits such as gates and registers, 

■ computer programming, 

■ finite state machines, 

■ fundamental mathematical concepts such as tuples, integrals, and linear 
equations, 

■ algorithms (graph algorithms and optimization algorithms such as branch 
and bound), 

■ the concept of NP-completeness. 

A key goal of this book is to provide an overview of embedded system design 

and to relate the most important topics in embedded system design to each 

other. It should help to motivate students and teachers to look at more details. 

While the book covers a number of topics in detail, others are covered only 

briefly. These brief secfions have been included in order to put a number of 
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Figure 0.1. Positioning of the topics of this book 

related issues into perspective. Furthermore, this approach allows lecturers to 
have appropriate links in the book for adding complementary material of their 
choice. The book should be complemented by follow-up courses providing a 
more specialized knowledge in some of the following areas: 

■ digital signal processing, 

■ robotics, 

■ machine vision, 

■ sensors and actors, 

■ real-time systems, real-time operating systems, and scheduling, 

■ control systems, 

■ specification languages for embedded systems, 

■ computer-aided design tools for application-specific hardware, 

■ formal verification of hardware systems, 

■ testing of hardware and software systems, 

■ performance evaluation of computer systems, 

■ low-power design techniques, 

■ security and dependability of computer systems, 

■ ubiquitous computing, 

■ application areas such as telecom, automotive, medical equipment, and 
smart homes. 
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■ impact of embedded systems. 

A course using this book should be complemented by an exiting lab, using, for 
example, small robots, such as Lego Mindstorm^^ or similar robots. Another 
option is to let students gain some practical experience with StateCharts-based 
tools. 

Additional information related to the book can be obtained from the fol- 
lowing web page: 

http ://ls 12- www.cs.uni- dortmund.de/~marwedel/kluwer- es- book. 

This page includes links to slides, exercises, hints for running labs, references 
to selected recent publications and error corrections. Readers who discover er- 
rors or who would like to make comments on how to improve the book should 
send an e-mail to peter.marwedel@udo.edu. 

Assignments could also use the information in complementary books [Ganssle, 
1992], [Ball, 1996], [Ball, 1998], [Barr, 1999], [Ganssle, 2000], [Wolf, 2001], 
[Buttazzo, 2002]. 

The use of names in this book without any reference to copyrights or trademark 
rights does not imply that these names are not protected by these. 

Please enjoy reading the book! 

Dortmund (Germany), September 2003 
P. Marwedel 



Welcome to the current updated version of this book! The merger of Kluwer 
and Springer publishers makes it possible to publish this version of the book 
less than two years after the initial 2003 version. In the current version, all ty- 
pos and errors found in the original version have been corrected. Moreover, all 
Internet references have been checked and updated. Apart from these changes, 
the content of the book has not been modified. A list of the errors corrected is 
available at the web page listed above. 

Please enjoy reading this updated book. 

Dortmund (Germany), August 2005 
P. Marwedel 
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Chapter 1 



INTRODUCTION 



1.1 Terms and scope 

Until the late eighties, information proeessing was associated with large main- 
frame computers and huge tape drives. During the nineties, this shifted towards 
information processing being associated with personal computers, PCs. The 
trend towards miniaturization continues and the majority of information pro- 
cessing devices will be small portable computers integrated into larger prod- 
ucts. Their presence in these larger products, such as telecommunication equip- 
ment will be less obvious than for the PC. Hence, the new trend has also been 
called the disappearing computer. However, with this new trend, the com- 
puter will actually not disappear, it will be everywhere. This new type of 
information technology applications has also been called ubiquitous com- 
puting [Weiser, 2003], pervasive computing [Hansmann, 2001], [Burkhardt, 
2001], and ambient intelligence [Koninklijke Philips Electronics N.V., 2003], 
[Marzano and Aarts, 2003]. These three terms focus on only slightly different 
aspects of future information technology. Ubiquitous computing focuses on the 
long term goal of providing “information anytime, anywhere”, whereas perva- 
sive computing focuses a somewhat more on practical aspects and the exploita- 
tion of already available technology. For ambient intelligence, there is some 
emphasis on communication technology in future homes and smart buildings. 
Embedded systems are one of the origins of these three areas and they provide 
a major part of the necessary technology. Embedded systems are iuforma- 
tiou processing systems that are embedded into a larger product and that 
are normally not directly visible to the user. Examples of embedded systems 
include information processing systems in telecommunication equipment, in 
transportation systems, in fabrication equipment and in consumer electronics. 
Common characteristics of these systems are the following: 
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■ Frequently, embedded systems are eonneeted to the physieal environment 
through sensors eollecting information about that environment and actua- 
tors^ controlling that environment. 

■ Embedded systems have to be dependable. 

Many embedded systems are safety-critical and therefore have to be de- 
pendable. Nuclear power plants are an example of extremely safety-critical 
systems that are at least partially controlled by software. Dependability 
is, however, also important in other systems, such as cars, trains, airplanes 
etc. A key reason for being safety-critical is that these systems are directly 
connected to the environment and have an immediate impact on the envi- 
ronment. 

Dependability encompasses the following aspects of a system: 

1 Reliability: Reliability is the probability that a system will not fail. 

2 Maintainability: Maintainability is the probability that a failing sys- 
tem can be repaired within a certain time-frame. 

3 Availability: Availability is the probability that the system is available. 
Both the reliability and the maintainability must be high in order to 
achieve a high availability. 

4 Safety: This term describes the property that a failing system will not 
cause any harm. 

5 Security: This term describes the property that confidential data re- 
mains confidential and that authentic communication is guaranteed. 

■ Embedded systems have to be efficient. The following metrics can be used 
for evaluating the efficiency of embedded sysfems: 

1 Energy: Many embedded sysfems are mobile systems obfaining fheir 
energy fhrough bafferies. According fo forecasls [SEMATECH, 2003], 
baffery technology will improve only af a very slow rale. However, 
compulalional requiremenls are increasing al a rapid rate (especially for 
mulfimedia applications) and customers are expecfing long run-limes 
from fheir bafferies. Therefore, Ihe available eleclrical energy musl be 
used very efficienlly. 

2 Code-size: All Ihe code lo be run on an embedded syslem has to be 
slored wilh Ihe system. Typically, Ihere are no hard discs on which 
code can be slored. Dynamically adding addifional code is still an ex- 
ception and limited to cases such as Java-phones and sef-lop boxes. 



tin this context, actuators are devices converting numerical values into physical effects. 
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Due to all the other constraints, this means that the code-size should 
be as small as possible for the intended application. This is especially 
true for systems on a chip (SoCs), systems for which all the informa- 
tion processing circuits are included on a single chip. If the instruction 
memory is to be integrated onto this chip, it should be used very effi- 
ciently. 

3 Run-time efficiency: The minimum amount of resources should be 
used for implementing the required functionality. We should be able to 
meet time constraints using the least amount of hardware resources and 
energy. In order to reduce the energy consumption, clock frequencies 
and supply voltages should be as small as possible. Also, only the 
necessary hardware components should be present. Components which 
do not improve the worst case execution time (such as many caches or 
memory management units) can be omitted. 

4 Weight: All portable systems must be of low weight. Low weight is 
frequently an important argument for buying a certain system. 

5 Cost: For high-volume embedded systems, especially in consumer 
electronics, competitiveness on the market is an extremely crucial is- 
sue, and efficient use of hardware components and the software devel- 
opment budget are required. 

■ These systems are dedicated towards a certain application. 

For example, processors running control software in a car or a train will 
always run that software, and there will be no attempt to run a computer 
game or spreadsheet program on the same processor. There are mainly two 
reasons for this: 

1 Running additional programs would make those systems less depend- 
able. 

2 Running additional programs is only feasible if resources such as mem- 
ory are unused. No unused resources should be present in an efficient 
system. 

■ Most embedded systems do not use keyboards, mice and large computer 
monitors for their user-interface. Instead, there is a dedicated user-inter- 
face consisting of push-buttons, steering wheels, pedals etc. Because of 
this, the user hardly recognizes that information processing is involved. 
Due to this, the new era of computing has also been characterized by the 
disappearing computer. 

■ Many embedded systems must meet real-time constraints. Not complet- 
ing computations within a given time-frame can result in a serious loss of 
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the quality provided by the system (for example, if the audio or video qual- 
ity is affected) or may cause harm to the user (for example, if cars, trains 
or planes do not operate in the predicted way). A time-constraint is called 
hard if not meeting that constraint could result in a catastrophe [Kopetz, 
1997]. All other time constraints are called soft. 

Many of today’s information processing systems are using techniques for 
speeding-up information processing on the average. For example, caches 
improve the average performance of a system. In other cases, reliable com- 
munication is achieved by repeating certain transmissions. For example, 
Internet protocols typically rely on resending messages in case the original 
messages have been lost. On the average, such repetitions result in a (hope- 
fully only) small loss of performance, even though for a certain message 
the communication delay can be orders of magnitude larger than the nor- 
mal delay. In the context of real-time systems, arguments about the average 
performance or delay cannot be accepted. A guaranteed system response 
has to be explained without statistical arguments [Kopetz, 1997]. 

■ Many embedded systems are hybrid systems in the sense that they include 
analog and digital parts. Analog parts use continuous signal values in con- 
tinuous time, whereas digital parts use discrete signal values in discrete 
time. 

■ Typically, embedded systems are reactive systems. They can be defined 
as follows: A reactive system is one that is in continual interaction with its 
environment and executes at a pace determined by that environment [Berge 
et ah, 1995]. Reactive systems can be thought of as being in a certain state, 
waiting for an input. For each input, they perform some computation and 
generate an output and a new state. Therefore, automata are very good mod- 
els of such systems. Mathematical functions, which describe the problems 
solved by most algorithms, would be an inappropriate model. 

■ Embedded systems are under-represented in teaching and in public dis- 
cussions. Embedded chips aren’t hyped in TV and magazine ads ...[Ryan, 
1995]. One of the problems in teaching embedded system design is the 
equipment which is needed to make the topic interesting and practical. 
Also, real embedded systems are very complex and hence difficult to teach. 

Due to this set of common characteristics (except for the last one), it does make 
sense to analyze common approaches for designing embedded systems, instead 
of looking at the different application areas only in isolation. 

Actually, not every embedded system will have all the above characteristics. 
We can define the term “embedded system” also in the following way: Infor- 
mation processing systems meeting most of the characteristics listed above are 
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called embedded systems. This definition includes some fuzziness. However, 
it seems to be neither necessary nor possible to remove this fuzziness. 

Most of the characteristics of embedded systems can also be found in a re- 
cently introduced type of computing: ubiquitous or pervasive computing, also 
called ambient intelligence. The key goal of this type of computing is to make 
information available anytime, anywhere. It does therefore comprise commu- 
nication technology. Fig. 1.1 shows a graphical representation of how ubiq- 
uitous computing is influenced by embedded systems and by communication 
technology. 



For example, ubiquitous computing has to meet real-time and dependability re- 
quirements of embedded systems while using fundamental techniques of com- 
munication technology, such as networking. 

1.2 Application areas 

The following list comprises key areas in which embedded systems are used: 

■ Automotive electronics: Modern cars can be sold only if they contain a 
significant amount of electronics. These include air bag control systems, 
engine control systems, anti-braking systems (ABS), air-conditioning, GPS- 
systems, safety features, and many more. 

■ Aircraft electronics: A significant amount of the total value of airplanes 
is due to the information processing equipment, including flight control 
systems, anti-collision systems, pilot information systems, and others. De- 
pendability is of utmost importance. 




Figure 1.1. Influence of embedded systems on ubiquitous computing 
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■ Trains: For trains, the situation is similar to the one diseussed for ears and 
airplanes. Again, safety features eontribute significantly to the total value 
of trains, and dependability is extremely important. 

■ Telecommunication: Mobile phones have been one of the fastest grow- 
ing markets in the recent years. For mobile phones, radio frequency (RF) 
design, digital signal processing and low power design are key aspects. 

■ Medical systems: There is a huge potential for improving the medical 
service by taking advantage of information processing taking place within 
medical equipment. 

■ Military applications: Information processing has been used in military 
equipment for many years. In fact, some of the very first computers ana- 
lyzed military radar signals. 

■ Authentication systems: Embedded systems can be used for authentica- 
tion purposes. 

For example, advanced payment systems can provide more security than 
classical systems. The SMARTpen@ [IMEC, 1997] is an example of such 
an advanced payment system (see fig. 1.2). 




The SMARTpen is a pen-like insfrumenf analyzing physical parameters 
while ifs user is signing. Physical parameters include fhe fill, force and 
accelerafion. These values are fransmiffed fo a hosf PC and compared wifh 
information available abouf fhe user. As a resulf, if can be checked if bofh 
fhe image of fhe signafure as well as fhe way if has been produced coincide 
wifh fhe stored information. 

Ofher aulhenficafion systems include finger prinf sensors or face recogni- 
tion systems. 

■ Consumer electronics: Video and audio equipment is a very important 
sector of the electronics industry. The information processing integrated 
into such equipment is steadily growing. New services and better qual- 
ity are implemented using advanced digital signal processing techniques. 
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Many TV sets, multimedia phones, and game consoles comprise high- 
performance processors and memory systems. They represent special cases 
of embedded systems. 

■ Fabrication equipment: Fabrication equipment is a very traditional area in 
which embedded systems have been employed for decades. Safety is very 
important for such systems, the energy consumption is less a problem. As 
an example, fig. 1.3 (taken from Kopetz [Kopetz, 1997]) shows a container 
connected to a pipe. The pipe includes a valve and a sensor. Using the 
readout from the sensor, a computer may have to control the amount of 
liquid leaving the pipe. 




Figure 1.3. Controlling a valve 

The valve is an example of an actuator (see definition on page 2). 

■ Smart buildings: Information processing can be used to increase the com- 
fort level in buildings, can reduce the energy consumption within buildings, 
and can improve safety and security. Subsystems which traditionally were 
unrelated have to be connected for this purpose. There is a trend towards 
integrating air-conditioning, lighting, access control, accounting and dis- 
tribution of information into a single system. For example, energy can be 
saved on cooling, heating and lighting of rooms which are empty. Available 
rooms can be displayed at appropriate places, simplifying ad-hoc meetings 
and cleaning. Air condition noise can be reduced to a level required for the 
actual operating conditions. Intelligent usage of blinds can optimize light- 
ing and air-conditioning. Tolerance levels of air conditioning subsystems 
can be increased for empty rooms, and the lighting can be automatically 
reduced. Lists of non-empty rooms can be displayed at the entrance of 
the building in emergency situations (provided the required power is still 
available). 

Initially, such systems will mostly be present only in high-tech office build- 
ings. 

■ Robotics: Robotics is also a fradifional area in which embedded sysfems 
have been used. Mechanical aspecfs are very imporfanf for robofs. Mosf of 
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the characteristics described above also apply to robotics. Recently, some 
new kinds of robots, modeled after animals or human beings, have been 
designed. Fig. 1.4 shows such a robot. 




Figure 1-4- Robot “Johnnie” (courtesy H. Ulbrich, F. Pfeiffer, Lehrstuhl fiir Angewandte 
Mechanik, TU Miinchen), (c)TU Miinchen 



This set of examples demonstrates the huge variety of embedded systems. Why 
does it make sense to consider all these types of embedded systems in one 
book? It makes sense because information processing in these systems has 
many common characteristics, despite being physically so different. 

1.3 Growing importance of embedded systems 

The size of the embedded system market can be analyzed from a variety of 
perspectives. Looking at the number of processors that are currently used, it 
has been estimated that about 79% of all the processors are used in embedded 
systems^. Many of the embedded processors are 8-bit processors, but despite 
this, 75% of all 32-bit processors are integrated into embedded systems [Stiller, 
2000]. Already in 1996, it was estimated that the average American came into 
contact with 60 microprocessors per day [Camposano and Wolf, 1996]. Some 



^Source: Electronic design. 
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high-end cars contain more than 100 processors^. These numbers are much 
larger than what is typically expected, since most people do not realize that 
they are using processors. The importance of embedded systems was also 
stated by journalist Mary Ryan [Ryan, 1995]: 

... embedded chips form the backbone of the electronics driven world in which 
we live. ... they are part of almost everything that runs on electricity. 

According to quite a number of forecasts, the embedded system market will 
soon be much larger than the market for PC-like systems. Also, the amount 
of software used in embedded systems is expected to increase. According to 
Vaandrager, /or many products in the area of consumer electronics the amount 
of code is doubling every two years [Vaandrager, 1998]. 

Embedded systems form the basis of the so-called post-PC era, in which infor- 
mation processing is more and more moving away from just PCs to embedded 
systems. 

The growing number of applications results in the need for design technologies 
supporting the design of embedded systems. Currently available technologies 
and tools still have important limitations. For example, there is still a need for 
better specification languages, tools generating implementations from specifi- 
cations, timing verifiers, real-time operating systems, low-power design tech- 
niques, and design techniques for dependable systems. This book should help 
teaching the essential issues and should be a stepping stone for starting more 
research in the area. 

1.4 Structure of this book 

Traditionally, the focus of a number of books on embedded systems is on ex- 
plaining the use of micro-controllers, including their memory, I/O and interrupt 
structure. There are many such books [Ganssle, 1992], [Ball, 1996], [Ball, 
1998], [Barr, 1999], [Ganssle, 2000]. 

Due to this increasing complexity of embedded systems, this focus has to 
be extended to include at least the different specification languages, hard- 
ware/software codesign, compiler techniques, scheduling and validation tech- 
niques. In the current book, we will be covering all these areas. The goal is to 
provide students with an introduction to embedded systems, enabling students 
to put the different areas into perspective. 

For further details, we recommend a number of sources (some of which have 
also been used in preparing this book): 



^According to personal communication. 
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■ There is a large number of sources of information on specification lan- 
guages. These include earlier books by Young [Young, 1982], Burns and 
Wellings [Burns and Wellings, 1990] and Berge [Berge et ah, 1995]. There 
is a huge amount of information on new languages such as SystemC [Muller 
et ah, 2003], SpecC [Gajski et ah, 2000], Java etc. 

■ Approaches for designing and using real-time operating systems (RTOSes) 
are presented in a book by Kopetz [Kopetz, 1997]. 

■ Real-time scheduling is covered comprehensively in the books by Buttazzo 
[Buttazzo, 2002] and by Krishna and Shin [Krishna and Shin, 1997]. 

■ Lecture manuscripts of Rajiv Gupta [Gupta, 1998] provide a survey of em- 
bedded systems. 

■ Robotics is an area that is closely linked with embedded systems. We rec- 
ommend the book by Fu, Gonzalez and Lee [Fu et ah, 1987] for information 
on robotics. 

■ Additional information is provided by the ARTIST roadmap [Bouyssounouse 
and Sifakis, 2005] and a book by Vahid [Vahid, 2002]. 

The structure of this book corresponds to that of a simplified design informa- 
tion flow for embedded systems, shown in figure 1.5. 




Figure 1.5. Simplified design information flow 

The design information flow starts with ideas in people’s heads. These ideas 
must be captured in a design specification. In addition, standard hardware and 
software components are typically available and should be reused whenever 
possible. 

Design activities start from the specification. Typically, they involve a con- 
sideration of both hardware and software, since both have to be taken into 
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account for embedded system design. Design activities comprise mapping 
operations to concurrent tasks, high-level transformations (such as advanced 
loop transformations), mapping of operations to either hardware or software 
(called hardware/software partitioning), design space exploration, compilation, 
and scheduling. It may be necessary to design special purpose hardware or to 
optimize processor architectures for a given application. However, hardware 
design is not covered in this book. Standard compilers can be used for the 
compilation. However, they are frequently not optimized for embedded pro- 
cessors. Therefore, we will also briefly cover compiler techniques that should 
be applied in order to obtain the required efficiency. Once binary code has 
been obtained for each task, it can be scheduled precisely. Final software and 
hardware descriptions can be merged, resulting in a complete description of 
the design and providing input for fabrication. 

At the current state of the art, none of the design steps can be guaranteed to be 
correct. Therefore, it is necessary to validate the design. Validation consists of 
checking intermediate or final design descriptions against other descriptions. 
Evaluation is another activity that is required during various phases of the de- 
sign. Various properties can be evaluated, including performance, dependabil- 
ity, energy consumption, manufacturability etc. 

Note that fig. 1.5 represents the flow of information about the design object. 
The sequence of design activities has to be consistent with that flow. This 
does not mean, however, that design activities correspond to a simple path 
from ideas to the final product. In practice, some design activities have to be 
repeated. For example, it may become necessary to return to the specification 
or to obtain additional application knowledge. It may also become necessary 
to consider additional standard operating systems if the initially considered 
operating system cannot be used for performance reasons. 

Consistent with the design information flow, this book is structured as fol- 
lows: in chapter 2, we will discuss specification languages. Key hardware 
components of embedded systems will be presented in chapter 3. Chapter 4 is 
devoted towards the description of real-time operating systems, other types of 
such middleware, and standard scheduling techniques. Standard design tech- 
niques for implementing embedded systems - including compilation issues - 
will be discussed in chapter 5. Finally, validation will be covered in the last 
chapter. 




Chapter 2 



SPECIFICATIONS 



2.1 Requirements 

Consistent with the simplified design flow (see fig. 1.5), we will now deseribe 
requirements and approaehes for specifying embedded systems. 

There may still be cases in which the specification of embedded systems is 
captured in a natural language, such as English. However, this approach is 
absolutely inappropriate since it lacks key requirements for specification tech- 
niques: it is necessary to check specifications for completeness, absence of 
contradictions and it should be possible to derive implementations from the 
specification in a systematic way. Therefore, specifications should be captured 
in machine readable, formal languages. Specification languages for embedded 
systems should be capable of representing the following features' : 



■ Hierarchy: Human beings are generally not capable of comprehending 
systems that contain many objects (states, components) having complex 
relations with each other. The description of all real-life systems needs 
more objects than human beings can understand. Hierarchy is the only 
mechanism that helps to solve this dilemma. Hierarchies can be introduced 
such that humans need to handle only a small number of objects at any 
time. 

There are two kinds of hierarchies: 



'information from the books of Bums et al. [Bums and Wellings, 1990], Berge et al. [Berge et al., 1995] 
and Gajski et al. [Gajski et al., 1994] is used in this list. 
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- Behavioral hierarchies: Behavioral hierarchies are hierarchies con- 
taining objects necessary to describe the system behavior. States, events 
and output signals are examples of such objects. 

- Structural hierarchies: Structural hierarchies describe how systems 
are composed of physical components. 

For example, embedded systems can be comprised of processors, mem- 
ories, actors and sensors. Processors, in turn, include registers, multi- 
plexers and adders. Multiplexers are composed of gates. 

■ Timing-behavior: Since explicit timing requirements are one of the char- 
acteristics of embedded systems, timing requirements must be captured in 
the specification. 

■ State-oriented behavior: It was already mentioned in chapter 1 that au- 
tomata provide a good mechanism for modeling reactive systems. There- 
fore, the state-oriented behavior provided by automata should be easy to 
describe. However, classical automata models are insufficient, since they 
cannot model timing and since hierarchy is not supported. 

■ Event-handling: Due to the reactive nature of embedded systems, mecha- 
nisms for describing events must exist. Such events may be external events 
(caused by the environment) or internal events (caused by components of 
the system). 

■ No obstacles to the generation of efficient implementations: Since em- 
bedded systems have to be efficient, no obstacles prohibiting the generation 
of efficient realizations should be present in the specification. 

■ Support for the design of dependable systems: Specification techniques 
should provide support for designing dependable systems. For example, 
specification languages should have unambiguous semantics, facilitate for- 
mal verification and be capable of describing security and safety require- 
ments. 

■ Exception-oriented behavior: In many practical, systems exceptions do 
occur. In order to design dependable systems, it must be possible to de- 
scribe actions to handle exceptions easily. It is not acceptable that excep- 
tions have to be indicated for each and every state (like in the case of clas- 
sical state diagrams). Example: In fig. 2.1, inpul k mighl correspond lo an 
exception. 

Specifying Ibis exceplion al each slate makes Ihe diagram very complex. 
The silualion would gel worse for larger slate diagrams wilh many fransi- 
lions. We will later show, how all Ihe Iransilions can be replaced by a single 



one. 
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Figure 2.1. State diagram with exception k 

■ Concurrency: Real-life systems are distributed, coneurrent systems. It is 
therefore necessary to be able to specify concurrency conveniently. 

■ Synchronization and communication: Concurrent actions have to be 

able to communicate and it must be possible to agree on the use of re- 
sources. For example, it is necessary to express mutual exclusion. 

■ Presence of programmin g elements: Usual programming languages have 
proven to be a convenient means of expressing computations that have to 
be performed. Hence, programming language elements should be available 
in the specification technique used. Classical state diagrams do not meet 
this requirement. 

■ Executability: Specifications are not automatically consistent with the 
ideas in people’s heads. Executing the specification is a means of plausi- 
bility checking. Specifications using programming languages have a clear 
advantage in this context. 

■ Support for the design of large systems: There is a trend towards large 
and complex embedded software programs. Software technology has found 
mechanisms for designing such large systems. For example, object-orien- 
tation is one such mechanism. It should be available in the specification 
methodology. 

■ Domain-specific support: It would of course be nice if the same speci- 
fication technique could be applied to all the different types of embedded 
systems, since this would minimize the effort for developing specification 
techniques and tool support. However, due to the wide range of application 
domains, there is little hope that one language can be used to efficiently 
represent specifications in all domains. For example, control-dominated, 
data-dominated, centralized and distributed applications-domains can all 
benefit from language features dedicated towards those domains. 

■ Readability: Of course, specifications have to be readable by human be- 
ings. Preferably, they should also be machine-readable into order to process 
them in a computer. 
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■ Portability and flexibility: Specifications should be independent of spe- 
cific hardware platforms so that they can be easily used for a variety of 
target platforms. They should be flexible such that small changes of the 
system’s features should also require only small changes to the specifica- 
tion. 

■ Termination: It should be feasible to identify processes that will terminate 
from the specification. 

■ Support for non-standard I/O-devices: Many embedded systems use 
I/O-devices other than those typically found on a PC. It should be possi- 
ble to describe inputs and outputs for those devices conveniently. 

■ Non-functional properties: Actual systems have to exhibit a number of 
non-functional properties, such as fault-tolerance, size, extendibility, ex- 
pected lifetime, power consumption, weight, disposability, user friendli- 
ness, electromagnetic compatibility (EMC) etc. There is no hope that all 
these properties can be defined in a formal way. 

■ Appropriate model of computation: In order to describe computations, 
computational models are required. Such models will be described in the 
next section. 

From the list of requirements, it is already obvious that there will not be any 
formal language capable of meeting all these requirements. Therefore, in prac- 
tice, we have to live with compromises. The choice of the language used for 
an actual design will depend on the application domain and the environment 
in which the design has to be performed. In the following, we will present a 
survey of languages that can be used for actual designs. 

2.2 Models of computation 

Applications of information technology have so far very much relied on the 
von Neumann paradigm of sequential computing. This paradigm is not appro- 
priate for embedded systems, in particular those with real-time requirements, 
since there is no notion of time in von Neumann computing. Other models of 
computation are more adequate. Models of computation can be described as 
follows [Lee, 1999]: 

■ Models of computation define components. Procedures, processes, func- 
tions, finite state machines are possible components. 



■ Models of computation define communication protocols. These protocols 
constrain the mechanism by which components can interact. Asynchronous 
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message passing and rendez-vous based communication are examples of 
communication protocols. 

■ Models of computation possibly also define what components know about 
each other (the information which components share). For example, they 
might share information about global variables. 

Models of computation do not define the vocabulary of the interaction of the 

components. 

Examples of models of computation capable of describing concurrency include 

the following [Lee, 1999], [Janka, 2002], [Jantsch, 2003]: 

■ Communicating finite state machines (CFSMs): CFSMs are collections 
of finite state machines communicating with each other. Methods for com- 
munication must be defined. This model of computation is used, for ex- 
ample, for StateCharts (see next section), the StateChart variant StateFlow, 
and SDL (see page 30). 

■ Discrete event model: In this model, there are events carrying a totally 
ordered time stamp, indicating the time at which the event occurs. Discrete 
event simulators typically contain a global event queue sorted by time. The 
disadvantage is that it relies on a global notion of one or more event queues, 
making it difficult to map the semantic model onto specific implementa- 
tions. Examples include VHDL (see page 59), Verilog (see page 75), and 
Simulink from Math Works (see page 79). 

■ Differential equations: Differential equations are capable to model analog 
circuits and physical systems. Hence, they can find applications in embed- 
ded system modeling. 

■ Asynchronous message passing: In asynchronous message passing, pro- 
cesses communicate by sending messages through channels that can buffer 
the messages. The sender does not need to wait for the receiver to be ready 
to receive the message. In real life, this corresponds to sending a letter. A 
potential problem is the fact that messages may have to be stored and that 
message buffers can overflow. 

There are several variations of this scheme, including Kahn process net- 
works (see page 53) and dataflow models. 

A dataflow program is specified by a directed graph where the nodes (ver- 
tices), called “actors”, represent computations and the arcs represent first- 
in first-out (FIFO) channels. The computation performed by each actor is 
assumed to be functional, that is, based on the input values only. Each pro- 
cess in a dataflow graph is decomposed into a sequence of firings, which 
are atomic actions. Each firing produces and consumes tokens. 
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Of particular interest is synchronous dataflow (SDF), which requires pro- 
cesses to consume and produce a fixed number of tokens each firing. SDF 
can be sfafically scheduled, which makes implemenfafions very efficienl. 

■ Synchronous message passing: In synchronous message passing, fhe com- 
ponenfs are processes. They communicafe in atomic, insfanfaneous acfions 
called rendez-vous. The process reaching fhe poinf of communicafion firsl 
has to wail unlil fhe parlner has also reached ils poinf of communicafion. 
There is no risk of overflows, buf fhe performance may suffer. 

Examples of languages following fhis model of compufafion include CSP 
(see page 55) and ADA (see page 55). 

Differenl applications may require fhe use of differenl models. While some of 
fhe acfual languages implemenf only one of fhe models, ofhers allow a mix of 
models. 

2.3 StateCharts 

The firsl acfual language which will be presented is SlaleCharls. SlaleCharls 
was inlroduced in 1987 by David Harel [Hard, 1987] and laler described more 
precisely [Drusinsky and Harel, 1989]. SlaleCharls describes communicating 
finile slale machines. If based on fhe shared memory concepl of communica- 
tion. According to Harel, fhe name was chosen since if was the only unused 
combination of “flow” or “state” with “diagram” or “chart”. 

We mentioned in fhe previous seclion lhal we need lo describe slale-orienled 
behavior. Slale diagrams are a classical melhod of doing fhis. Fig. 2.2 (fhe 
same as fig. 2.1) shows an example of a classical slale diagram, representing a 

flnite state machine (FSM). 




Figure 2.2. State diagram 

Circles denote states. At any time, deterministic FSMs which we will consider 
here, can only be in one of their states. Edges denote state transitions. Edge 
labels represent events. If an event happens, the ESM will change its state 
as indicated by the edge. ESMs may also generate output (not shown in fig. 
2.2). Eor more information about classical ESMs refer to, for example, Kohavi 
[Kohavi, 1987]. 
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2.3.1 Modeling of hierarchy 

StateCharts describe extended FSMs. Due to this, they can be used for mod- 
eling state-oriented behavior. The key extension is hierarchy. Hierarchy is 
introduced by means of super-states. 

Definitions: 

■ States comprising other states are called super-states. 

■ States included in super-states are called sub-states of the super-states. 

Fig. 2.3 shows a StateCharts example. It is a hierarchical version of fig. 2.2. 




Figure 2.3. Hierarchical state diagram 

Super-state S includes states A, B, C, D and E. Suppose the FSM is in state Z 
(we will also call Z to be an active state). Now, if input m is applied to the 
FSM, then A will be the new state. If the FSM is in S and input k is applied, 
then Z will be the new state, regardless of whether the FSM is in sub-states A, 
B, C, D or E of S. In this example, all states contained in S are non-hierarchical 
states. In general, sub-states of S could again be super-states consisting of 
sub-states themselves. 

Definitions: 

■ Each state which is not composed of other states is called a basic state. 

■ For each basic state s, the super states containing s are called ancestor 
states. 

The FSM of fig. 2.3 can only be in one of fhe sub-sfafes of super-sfafe S af any 
lime. Super slates of Ihis lype are called OR-super-states^. 



^More precisely, they should be called XOR-super-states, since the FSM is in either A, B, D or E. However, 
this name is not commonly used in the literature. 
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In fig. 2.3, k might correspond to an exception for which state S has to be 
left. The example already shows that the hierarchy introduced in StateCharts 
enables a compact representation of exceptions. 

StateCharts allows hierarchical descriptions of systems in which a system de- 
scription comprises descriptions of subsystems which, in turn, may contain de- 
scriptions of subsystems. The description of the entire system can be thought 
of as a tree. The root of the tree corresponds to the system as a whole, and 
all inner nodes correspond to hierarchical descriptions (in the case of State- 
Charts called super-nodes). The leaves of the hierarchy are non-hierarchical 
descriptions (in the case of StateCharts called basic states). 

So far, we have used explicit, direct edges to basic states to indicate the next 
state. The disadvantage of that approach is that the internal structure of super- 
states cannot be hidden from the environment. However, in a true hierarchical 
environment, we should be able to hide the internal structure so that it can 
be described later or changed later without affecting the environment. This is 
possible with other mechanisms for describing the next state. 

The first additional mechanism is the default state mechanism. It can be used 
in super-states to indicate the particular sub-states that will be entered if the 
super-states are entered. In diagrams, default states are identified by edges 
starting at small filled circles. Fig. 2.4 shows a state diagram using the default 
state mechanism. It is equivalent to the diagram in fig. 2.3. Note that the filled 
circle does not constitute a state itself. 




Figure 2-4 ■ State diagram using the default state mechanism 

Another mechanism for specifying next states is the history mechanism. With 
this mechanism, it is possible to return to the last sub-state that was active 
before a super-state was left. The history mechanism is symbolized by a circle 
containing the letter H. In order to define the next state for the very initial 
transition into a super-state, the history mechanism is frequently combined 
with the default mechanism. Fig. 2.5 shows an example. 

The behavior of the FSM is now somewhat different. If we input m while the 
system is in Z, then the FSM will enter A if this is the very first time we enter 
S, and otherwise it will enter the last state that we were in. This mechanism 
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Figure 2.5. State diagram using the history and the default state mechanism 

has many applications. For example, if k denotes an exception, we could use 
input m to return to the state we were in before the exception. States A, B, C, D 
and E could also call Z like a procedure. After completing “procedure” Z, we 
would return to the calling state. 

Fig. 2.5 can also be redrawn as shown if fig. 2.6. In this case, the symbols for 
the default and the history mechanism are combined. 




Figure 2. 6. Combining the symbols for the history and the default state mechanism 

Specification techniques must also be able to describe concurrency conve- 
niently. Towards this end, StateCharts provides a second class of super-states, 
so called AND-states. 

Definition; Super-states S are called AND-super-states if the system contain- 
ing S will be in all of the sub-states of S whenever it is in S. 

An AND-super-state is included in the answering machine shown in fig. 2.7. 

An answering machine normally performs fwo fasks concurrenfly: if is mon- 
itoring fhe line for incoming calls and fhe keys for user inpuf. In fig. 2.7, fhe 
corresponding sfafes are called Lwait and Kwaif. Incoming calls are processed 
in sfafe Lproc while fhe response fo pressed keys is generated in slate Kproc. 
For fhe lime being, we assume lhal fhe on/off swilch (generaling evenls key-off 
and key-on) is decoded separalely and pushing if does nol resull in enlering 
Kproc. If Ihis swilch is pushed, fhe line monitoring slate as well as fhe key 
moniforing slate are lefl and re-entered only if fhe machine is swilched on. 
Al lhal lime, defaull slates Lwait and Kwait are entered. While switched on. 
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the machine will always be in the line monitoring state as well as in the key 
monitoring state. 

For AND-super-states, the sub-states entered as a result of some event can be 
defined independently. There can be any combination of history, default and 
explicit transitions. It is crucial to understand that all sub-states will always be 
entered, even if there is just one explicit transition to one of the sub-states. Ac- 
cordingly, transitions out of an AND-super-state will always result in leaving 
all the sub-states. 

For example, let us modify our answering machine such that the on/off switch, 
like all other switches, is decoded in state Kproc (see fig. 2.8). 




Figure 2.8. Answering machine with modified on/off switch processing 
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If pushing that key is detected in Kproc, a transition is made to the off state. 
This transition results in leaving the line-monitoring state as well. Switching 
the machine on again results in also entering the line-monitoring state. 

Summarizing, we can state the following: States in StateCharts diagrams 
are either AND-states, OR-states or basic states. 

2.3.2 Timers 

Due to the requirement to model time in embedded systems, StateCharts also 
provides timers. Timers are denoted by the symbol shown in fig. 2.9 (left). 




Figure 2. 9. Timer in StateCharts 

After the system has been in the state containing the timer for the specified pe- 
riod, a fime-ouf will occur and fhe system will leave fhe specified sfafe. Timers 
can be used hierarchically. 

Timers can be used, for example, af fhe nexf lower level of fhe hierarchy of fhe 
answering machine in order fo describe fhe behavior of slate Lproc. Fig. 2.10 
shows a possible behavior for lhal sfafe. 




Figure 2. 1 0. Servicing the incoming line in Lproc 

Due fo fhe exceplion-like Iransilion for hangups by fhe caller in fig. 2.7, slate 
Lproc is lerminaled whenever fhe caller hangs up. For hangups (relurns) by fhe 
callee, fhe design of slate Lproc resulls in an inconvenience: If fhe callee hangs 
up fhe phone firsl, fhe lelephone will be dead (and quiel) until fhe caller has 
also hung up fhe phone. 
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StateCharts do include a number of other language elements. For a full descrip- 
tion refer to Harel [Hard, 1987]. A more detailed description of the semantics 
of the StateMate implementation of StateCharts is described by Drusinsky and 
Harel [Drusinsky and Harel, 1989]. 

2.3.3 Edge labels and StateCharts semantics 

Until now, we have not considered outputs generated by our extended FSMs. 
Generated outputs can be specified using edge labels. The general form of an 
edge label is “event [condition] / reaction”. All three label parts are optional. 
The reaction-p&rt describes the reaction of the FSM to a state transition. Pos- 
sible reactions include the generation of events and assignments to variables. 
The condition-part implies a test of the values of variables or a test of the cur- 
rent state of the system. The event-part refers to a test of current events. Events 
can be generated either internally or externally. Internal events are generated as 
a result of some transition and are described in reaction-parts. External events 
are usually described in the model environment. 

Examples: 

■ on-key / on:=1 (Event-test and variable assignment), 

■ [on=1] (Condition test for a variable value), 

■ off-key [not in Lproc] / on:=0 (Event-test, condition test for a state, variable 
assignment. The assignment is performed if the event has occurred and the 
condition is true). 

The semantics of edge labels can only be explained in the context of the se- 
mantics of StateCharts. According to the semantics of the StateMate imple- 
mentation of StateCharts [Drusinsky and Harel, 1989], a step-based execution 
of StateCharts-descriptions is assumed. Each step consists of three phases: 

1 In the first phase, the effect of external changes on conditions and events 
is evaluated. This includes the evaluation of functions which depend on 
external events. This phase does not include any state changes. In our 
simple examples, this phase is not actually needed. 

2 The next phase is to calculate the set of transitions that should be made in 
the current step. Variable assignments are evaluated, but the new values are 
only assigned to temporary variables. 

3 In the third phase, state transitions become effective and variables obtain 
their new values. 
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The separation into phases 2 and 3 is especially important in order to guarantee 
a deterministic and reproducible behavior of StateCharts models. Consider the 
StateCharts model of fig. 2.11. 




Figure 2.11. Mutually dependent assignments 

Due to the separation into two phases, new values for a and b are stored in 
temporary variables, say a’ and b’. In the final phase, temporary variables are 
copied into the used-defined variables: 

phase 2: a’:=b; b’:=a; 
phase 3: a:=a’; b:=b’ 

As a result, the two variables will be swapped each time an event e happens. 
This behavior corresponds to that of two cross-coupled registers (one for each 
variable) connected to the same clock (see fig. 2.11) and rellecls fhe operation 
of a clocked finile slate machine including fhose Iwo registers. 




Figure 2.12. Cross-coupled registers 

Wilhouf fhe separation info Iwo phases, fhe resulf would depend on fhe se- 
quence in which fhe assignmenls are performed. In any case, fhe same value 
would be assigned fo bofh variables. This separation info (al leasl) Iwo phases 
is quite typical for languages fhal fry lo rellecl fhe operation of synchronous 
hardware. We will find fhe same separation in VHDL (see page 73). 

The Ihree phases are assumed lo be executed for each step. Sleps are assumed 
lo be executed each lime evenls or variables have changed. The execution of 
a SlaleCharl model consisls of fhe execulion of a sequence of sleps (see fig. 
2.13), each step consisting of Ihree phases. 
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Status Step Status Step Status Step Status 

O -X 3 ->0 ->0 

Figure 2.13. Steps during the execution of a StateCharts model 



The set of all values of variables, together with the set of events generated 
(and the current time) is defined as the status^ of a StateCharts model. After 
executing the third phase, a new status is obtained. The notion of steps allows 
us to more precisely define the semantics of events. Events are generated, as 
mentioned, either internally or externally. The visibility of events is limited 
to the step following the one in which they are generated. Thus, events 
behave like single bit values which are stored in permanently enabled registers 
at one clock transition and have an effect on the values stored at the next clock 
transition. They do not live forever. 

Variables, in contrast, retain their values, until they are reassigned. According 
to StateCharts semantics, new values of variables are visible to all parts of the 
model from the step following the step in which the assignment was made on- 
wards. That means, StateCharts semantics implies that new values of variables 
are propagated to all parts of a model between two steps. StateCharts implicitly 
assumes a broadcast mechanism for updates on variables. For distributed 
systems, it will be very difficult to update all variables between two steps. Due 
to this broadcast mechanism, StateCharts is not an appropriate language for 
modeling distributed systems. 

2.3.4 Evaluation and extensions 

StateCharts’ main application domain is that of local, control-dominated sys- 
tems. The capability of nesting hierarchies at arbitrary levels, with a free choice 
of AND- and OR-states, is a key advantage of StateCharts. Another advantage 
is that the semantics of StateCharts is defined af a sufficienf level of defail 
[Drusinsky and Harel, 1989]. Furfhermore, fhere are quife a number of com- 
mercial fools based on SfafeCharfs. SfafeMafe (see hlfp://www.ilogix.com), 
SfafeFlow (see hflp://www.malhworks.com/producls/slafellow) and BefferSfafe 
(see hflp://www. windriver.com/producls/beflerslafe/index.hfml) are examples 
of commercial tools based on StateCharts. Many of these are capable of trans- 
lating StateCharts into equivalent descriptions in C or VHDF (see page 59). 
From VHDF, hardware can be generated using synthesis tools. Therefore, 
StateCharts-based tools provide a complete path from StateCharts-based spec- 



^We would normally use the term “state” instead of “status”. However, the term “state” has a different 
meaning in StateCharts. 
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ifications down to hardware. Generated C programs can be compiled and exe- 
cuted. Hence, a path to software-based realizations also exists. 

Unfortunately, the efficiency of the automatic translation is frequently a con- 
cern. For example, sub-states of AND-states may be mapped to UNIX-pro- 
cesses. This concept is acceptable for simulating StateCharts, but will hardly 
lead to efficient implementations on small processors. The productivity gain 
from object-oriented programming is not available in StateCharts, since it is 
not object-oriented. Furthermore, the broadcast mechanism makes it less ap- 
propriate for distributed systems. StateCharts do not comprise program con- 
structs for describing complex computation and cannot describe hardware struc- 
tures or non-functional behavior. 

Commercial implementations of StateCharts typically provide some mecha- 
nisms for removing the limitations of StateCharts. For example, C-code can 
be used to represent program constructs and module charts of StateMate can 
represent hardware structures. 

2.4 General language characteristics 

The previous section provides us with some first examples of properties of 
specification languages. These examples help us to understand a more general 
discussion of language properties in this section before we will discuss more 
languages in the sections that will follow. There are several characteristics by 
which we can compare the properties of languages. The first property is related 
to the distinction between deterministic and non-deterministic models already 
touched in our discussion of StateCharts. 

2.4.1 Synchronous and asynchronous languages 

A problem which exists for some languages based on general communicating 
finite state machines and sets of processes described in ADA or Java is that 
they are non-deterministic, since the order in which executable processes are 
executed is not specified. The order of execution may well affecl fhe resulf. 
This effecl can have a number of negative consequences, such as, for exam- 
ple, problems wifh verifying a cerfain design. The non-determinism is avoided 
wifh so-called synchronous languages. Synchronous languages describe con- 
currenfly operating aufomafa. ... when automata are composed in parallel, 
a transition of the product is made of the “simultaneous ” transitions of all 
of them [Halbwachs, 1998]. This means: we do nol have fo consider all fhe 
differenl sequences of sfafe changes of fhe aufomafa fhaf would be possible 
if each of fhem had ifs own clock. Instead, we can assume fhe presence of a 
single global clock. Each clock lick, all inpuls are considered, new oulpuls 




28 



EMBEDDED SYSTEM DESIGN 



and states are calculated and then the transitions are made. This requires a fast 
broadcast mechanism for all parts of the model. This idealistic view of con- 
currency has the advantage of guaranteeing deterministic behavior. This is a 
restriction if compared to the general CFSM model, in which each FSM can 
have its own clock. Synchronous languages reflect the principles of operation 
in synchronous hardware and also the semantics found in control languages 
such as lEC 60848 and STEP 7 (see page 78). Guaranteeing a deterministic 
behavior for all language features has been a design goal for the synchronous 
languages Esterel (see page 79) [Esterel, 2002] and Lustre [Halbwachs et ah, 
1991]. Due to the three simulation phases in StateCharts, StateCharts is also 
a synchronous language (and it is deterministic). Just like StateCharts, syn- 
chronous languages are difficult to use in a distributed environment, where the 
concept of a single clock results in difficulties. 

2.4.2 Process concepts 

There are several criteria by which we can compare the process concepts in 
different programming languages: 

■ The number of processes can be either static or dynamic. A static number 
of processes simplifies fhe implemenfafion and is sufficienf if each process 
models a piece of hardware and if we do nol consider “hol-plugging” (dy- 
namically changing fhe hardware archifecfure). Ofherwise, dynamic pro- 
cess creafion (and deafh) should be supported. 

■ Processes can eifher be sfafically nested or all declared at the same level. 
Eor example, StateCharts allows nested process declarations while SDL 
(see page 30) does not. Nesting provides encapsulation of concerns. 

■ Different techniques for process creation exist. Process creation can result 
from an elaboration of the process declaration in the source code, through 
the fork and join mechanism (supported for example in Unix), and also 
through explicit process creation calls. 

StateCharts is limited to a static number of processes. Processes can be nested. 
Process creation results from an elaboration of the source code. 

2.4.3 Synchronization and communication 

There are essentially two communication paradigms: shared memory and 
message passing. 

Eor shared memory, all variables can be accessed from all processes. Access 
to shared memory should be protected, unless access is totally restricted to 
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reads. If writes are involved, exclusive access to the memory must be guar- 
anteed while processes are accessing shared memories. Segments of code, for 
which exclusive access must be guaranteed, are called critical sections. Sev- 
eral mechanisms for guaranteeing exclusive access to resources have been pro- 
posed. These include semaphores, conditional critical regions and monitors. 
Refer to books on operating systems for a description of the different tech- 
niques. Shared memory-based communication can be very fast, but is difficult 
to implement in multiprocessor systems if no common memory is physically 
available. 

For message passing, messages are sent and received just like mails are sent 
on the Internet. Message passing can be implemented easily even if no com- 
mon memory is available. However, message passing is generally slower than 
shared memory based communication. For this kind of communication, we 
can distinguish between the following three techniques: 

■ asynchronous message passing, also called non-blocking communica- 
tion (see page 17), 

■ synchronous message passing or blocking communication, rendez-vous 
based communication (see page 18), 

■ extended rendez-vous, remote invocation: the transmitter is allowed to 
continue only after an acknowledgment has been received from the receiver. 
The recipient does not have to send this acknowledgment immediately af- 
ter receiving the message, but can do some preliminary checking before 
actually sending the acknowledgment. 

StateCharts allows global variables and hence uses the shared memory model. 

2.4.4 Specifying timing 

Burns and Wellings [Burns and Wellings, 1990] define the following four re- 
quirements for specification languages: 

■ Access to a timer, which provides a means to measure elapsed time: 

CSP, for example, meets this requirement by providing channels which are 
actually timers. Read operations to such a channel return the current time. 

■ Means for delaying of processes for a specified fime: 

Typically, real-lime languages provide some delay conslrucl. In VHDL, Ihe 
waif for-slalemenl (see page 69) can be used. 



Possibilily lo specify timeouts: 
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Real-time languages usually also provide some timeout construct. 

■ Methods for specifying deadlines and schedules: 

Unfortunately, most languages do not allow to specify timing constraints. 
If they can be specified at all, they have to be specified in separafe confrol 
files, pop-up menus efc. 

SfafeCharfs allows fimeoufs. There is no sfraighfforward way of specifying 
ofher timing requiremenfs. 

2.4.5 Using non-standard I/O devices 

Some languages include special language feafures enabling direcf confrol over 
I/O devices. For example, ADA allows variables fo be mapped fo specific 
memory addresses. These may be fhe addresses of memory mapped I/O de- 
vices. This way, all I/O operafions can be programmed in ADA. ADA also 
allows procedures fo be bound fo inferrupf addresses. 

No direcf supporf for I/O is available in sfandard SfafeCharfs, buf commercial 
implemenfafions can supporf I/O programming. 

2.5 SDL 

Because of fhe use of shared memory and fhe broadcasf mechanism, Sfafe- 
Charfs cannof be used for disfribufed applicafions. We will now furn our af- 
fenfion towards a second language, one which is applicable for modeling dis- 
fribufed sysfems, namely SDL. SDL was designed for disfribufed applicafions 
and is based on asynchronous message passing. If dafes back fo fhe early sev- 
enties. Formal semantics have been available since fhe lafe eighties. The lan- 
guage was sfandardized by fhe ITU (Infernafional Telecommunication Union). 
The first standards document is the Z.lOO Recommendation published in 1980, 
with updates in 1984, 1988, 1992 (SDL-92), 1996 and 1999. Relevant versions 
of the standard include SDL-88, SDL-92 and SDL-2000 [SDL Forum Society, 
2003a]. 

Many users prefer graphical specification languages while others prefer tex- 
tual specification languages. SDL pleases both types of users since it provides 
textual as well as graphical formats. Processes are the basic elements of SDL. 
Processes represent extended finite state machines. Extensions include oper- 
ations on data. Fig. 2.14 shows the graphical symbols used in the graphical 
representation of SDL. 

As an example, we will consider how the state diagram in fig. 2.15 can be 
represented in SDL. Fig. 2.15 is fhe same as fig. 2.4, excepf fhaf oufpuf has been 
added, sfafe Z deleted, and fhe effecf of signal k changed. Fig. 2. 16 confains fhe 
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Identifies initial state 




state 




input 



output 



Figure 2.14- Symbols used in the graphical form of SDL 




k 



Figure 2.15. FSM described in SDL 



corresponding graphical SDL representation. Obviously, the representation is 
equivalent to the state diagram of fig. 2.15. 
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Figure 2.16. SDL-representation of fig. 2.15 

As an extension to FSMs, SDL processes can perform operations on data. Vari- 
ables can be declared locally for processes. Their type can either be pre-defined 
or defined in fhe SDL description ifself. SDL supporfs absfracf dafa fypes 
(ADTs). The synfax for declarafions and operations is similar fo fhaf in ofher 
languages. Fig. 2.17 shows how declarations, assignmenfs and decisions can 
be represented in SDL. 

SDL also confains programming language elemenfs such as procedures. Pro- 
cedure calls can also be represenfed graphically. Objecf-orienfed fealures be- 
came available wifh version SDL- 1992 of fhe language and were extended wifh 
SDL-2000. 

Exfended FSMs are jusf fhe basic elemenfs of SDL descriptions. In general, 
SDL descriptions will consisf of a sef of inferacfing processes, or FSMs. Pro- 
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Figure 2. 1 7. Declarations, assignments and decisions in SDL 



cesses can send signals to other processes. Semantics of interprocess com- 
munication in SDL is based on first-in first-out (FIFO) queues associated 
with each process. Signals sent to a particular process will be placed into 
the corresponding FIFO-queue (see fig. 2.18). Therefore, SDL is based on 
asynchronous message passing. 




Figure 2.18. SDL interprocess communication 



Each process is assumed to fetch the next available entry from the FIFO queue 
and check whether it matches one of the inputs described for the current state. 
If it does, the corresponding state transition takes place and output is gener- 
ated. The entry from the FIFO-queue is ignored if it does not match any of the 
listed inputs (unless the so-called SAVE-mechanism is used). EIEO-queues are 
conceptually thought of as being of infinite length. This means: in the descrip- 
tion of the semantics of SDL models, EIEO-overllow is never considered. In 
actual systems, however, EIEO-queues must be of finite length. This is one of 
the problems of SDL: in order to derive realizations from specifications, safe 
upper bounds on the length of the EIEO-queues must be proven. 
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Process interaction diagrams can be used for visualizing which of the pro- 
cesses are communicating with each other. Process interaction diagrams in- 
clude channels used for sending and receiving signals. In the case of SDL, 
the term “signal” denotes inputs and outputs of modeled automata. Process 
interaction diagrams are special cases of block diagrams (see below). 

Example: Fig. 2.19 shows a process interaction diagram B1 with channels Sw1 
and Sw2. Brackets include the names of signals propagated along a certain 
channel. 




Figure 2.19. Process interaction diagram 



There are three ways of indicating the recipient of signals: 



1 Through process identifiers: by using identifiers of recipient processes in 
the graphical output symbol (see fig. 2.20 (leff)). 



Counter 

TO OFFSPRING 



Counter 
VIA Swi 



Figure 2.20. Describing signal recipients 

Acfually, fhe number of processes does nof even need fo be fixed af com- 
pile lime, since processes can be generated dynamically al run-lime. OFF- 
SPRING represenls idenlifiers of child processes generated dynamically by 
a process. 

2 Explicitly: by indicating the channel name (see fig. 2.20 (right)). Sw1 is 
the name of a channel. 

3 Implicitly: if signal names imply the channel names, those channels are 
used. Example: for fig. 2.19, signal B will implicitly always be communi- 
cated via channel Sw1 . 



No process can be defined within any other (processes cannot be nested). How- 
ever, they can be grouped hierarchically into so-called blocks. Blocks at the 
highest hierarchy level are called systems, blocks at the lowest hierarchy level 
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are called process interaction diagrams. B1 can be used within intermediate 
level blocks (such as within B in fig. 2.21). 



Block B 
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Figure 2.21. SDL block 

At the highest level in the hierarchy, we have the system (see fig. 2.22). A 
sysfem will nol have any channels af ifs boundary if fhe environmenf is also 
modeled as a block. 




Figure 2.22. SDL system 

Fig. 2.23 shows fhe hierarchy modeled by block diagrams 2.19, 2.21 and 2.22. 
Process inferacfion diagrams are nexf fo fhe leaves of fhe hierarchical descrip- 
tion, sysfem descripfions fheir root. Some of fhe resfricfions of modeling hi- 
erarchy are removed in version SDL-2000 of fhe language. Wifh SDL-2000, 
fhe descriptive power of blocks and processes is harmonized and replaced by a 
general agent concepf. 




Figure 2.23. SDL hierarchy 

In order fo supporf fhe modeling of lime, SDL includes timers. Timers can 
be declared locally for processes. They can be set and reset using SET and 
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RESET primitives, respectively. Eig. 2.24 shows the use of a timer T. The 
diagram corresponds to that of fig. 2.16, with the exceptions that timer T is 
set to the current time plus p during the transition from state D to E. Eor the 
transition from E to A we now have a timeout of p time units. If these time 
units have elapsed before signal f has arrived, a transition to state A is taken 
without generating output signal v. 
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Figure 2.24- Using timer T 

SDE can be used, for example, to describe protocol stacks found in computer 
networks. Eig. 2.25 shows three processors connected through a router. Com- 
munication between processors and the router is based on EIEOs. 




Figure 2.25. Small computer network described in SDL 

The processors as well as the router implement layered protocols (see fig. 
2.26). 

Each layer describes communicafion af a more absfracl level. The behavior of 
each layer is typically modeled as a finife sfafe machine. The defailed descrip- 
tion of fhese ESMs depends on fhe nefwork profocol and can be quife complex. 
Typically, Ibis behavior includes checking and handling error conditions, and 
sorting and forwarding of informalion packages. 

Currenfly (2003) available fools for SDE include inferfaces fo UME (see page 
45), MSCs (see page 44), and CHIEE (see page 78) from companies such as 
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Figure 2.26. Protocol stacks represented in SDL 



Telelogic [Telelogic AB, 2003], Cinderella [Cinderella ApS, 2003] and SIN- 
TER A comprehensive list of tools is available from the SDL forum [SDL 
Lorum Society, 2003b]. 

SDL is excellent for distributed applications and was used, for example, for 
specifying ISDN. Commercial tools for SDL are available (see, for example, 
http://www.telelogic.com). SDL is not necessarily deterministic (the order, 
in which signals arriving at some LILO at the same time are processed, is not 
specified). Reliable implementations require the knowledge of a upper bound 
on the length of the LILOs. This upper bound may be difficult to compute. The 
timer concept is sufficient for soft deadlines, but not for hard ones. Hierarchies 
are not supported in the same way as in StateCharts. There is no full program- 
ming support (but recent revisions of the standard have started to change this) 
and no description of non-functional properties. 

2.6 Petri nets 
2.6.1 Introduction 

In 1962, Carl Adam Petri published his method for modeling causal dependen- 
cies, which became known as Petri nets. The key strength of Petri nets is this 
focus on causal dependencies. Petri nets do not assume any global synchro- 
nization and are therefore especially suited for modeling distributed systems. 

Conditions, events and a flow relation are the key elements of Petri nets. 
Conditions are either satisfied or not satisfied. Events can happen. The flow 
relation describes the conditions that must be met before events can happen 
and it also describes the conditions that become true if events happen. 

Graphical notations for Petri nets typically use circles to denote conditions and 
boxes to denote events. Arrows represent flow relations. Pig. 2.27 shows a first 
example. 
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train entering track 
from the left 



train wanting 
to go right 



train going 
to the right 



train leaving track 
to the right 




train going 
to the left 

single-laned- 



Figure 2.21. Single track railroad segment 



This example describes mutual exclusion for trains at a railroad track that must 
be used in both directions. A token is used to prevent collisions of trains going 
into opposite directions. In the Petri net representation, that token is symbol- 
ized by a condition in the center of the model. A filled circle denotes the situ- 
ation in which the condition is met (this means: the track is available). When 
a train wants to go to the right (also denoted by a filled circle in fig. 2.27), fhe 
fwo condifions fhaf are necessary for fhe evenf “frain enfering frack from fhe 
leff” are mef. We call fhese fwo condifions preconditions. If fhe precondi- 
tions of an evenf are mef, if can happen. As a resulf of fhaf evenf happening, 
fhe token is no longer available and fhere is no frain waifing to enfer fhe frack. 
Hence, fhe precondifions are no longer mef and fhe filled circles disappear (see 
fig. 2.28). 



train wanting train going 




train going 
to the left 



Figure 2.28. Using resource “track” 

However, there is now a train going on that track from the left to the right and 
thus the corresponding condition is met (see tig. 2.28). A condition which is 
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met after an event happened is called a postcondition. In general, an event 
can happen only if all its preconditions are true (or met). If it happens, the 
preconditions are no longer met and the postconditions become valid. Arrows 
identify those conditions which are preconditions of an event and those that 
are postconditions of an event. Continuing with our example, we see that a 
train leaving the track will return the token to the condition at the center of the 
model (see fig. 2.29). 



train wanting train going 




train going 
to the left 



Figure 2.29. Freeing resource “track” 

If there are two trains competing for the single-track segment (see fig. 2.30), 
only one of them can enter. 



train wanting train going 




train going 
to the left 



Figure 2.30. Conflict for resource “track” 

Let us now consider a larger example: 

We are again considering the synchronization of trains. In particular, we are 
trying to model high-speed Thalys trains traveling between Amsterdam, Co- 
logne, Brussels and Paris. Segments of the train run independently from Ams- 
terdam and Cologne to Brussels. There, the segments get connected and then 
they run to Paris. On the way back from Paris, they get disconnected at Brus- 
sels again. We assume that Thalys trains have to synchronize with some other 
train at Paris. The corresponding Petri net is shown in fig. 2.31. 
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Figure 2.31. Model of Thalys trains running between Amsterdam, Cologne, Brussels, and 
Paris 



Places 3 and 10 model trains waiting at Amsterdam and Cologne, respectively. 
Transitions 9 and 2 model trains driving from these cities to Brussels. After 
their arrival at Brussels, places 9 and 2 contain tokens. Transition 1 denotes 
connecting the two trains. The cup symbolizes the driver of one of the trains, 
who will have a break at Brussels while the other driver is continuing on to 
Paris. Transition 5 models synchronization with other trains at the Gare du 
Nord station of Paris. These other trains connect Gare du Nord with some other 
station (we have used Gare de Lyon as an example, even though the situation 
at Paris is somewhat more complex). Of course, Thalys trains do not use steam 
engines; they are just easier to visualize than modern high speed trains. 

A key advantage of Petri nets is that they can be the basis for formal proofs 
about system properties and that there are standardized ways of generating 
such proofs. In order to enable such proofs, we need a more formal definition 
of Petri nets. 
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2.6.2 Condition/event nets 

Condition/event nets are the first elass of Petri nets that we will define more 
formally. 

Definition: N = {C,E,F) is called a net, iff the following holds: 

1 C and E are disjoint sets. 

2 F C (£■ X C) U (C X £■) is a binary relation, called flow relation. 

The set C is called conditions and the set E is called events. 

Def.: Let be a net and let x G (CUF). *x := {y\yEx} is called the set of 
preconditions of x and x * := {y|xFy} is called the set of postconditions of x. 

This definition is mostly used for the case of x G F, but it applies also to the 
case of X G C. 

Def.: Let {c,e) x E. 

1 (c,e) is called a loop, if cEe A eEc. 

2 N is called pure, if F does not contain any loops (see fig. 2.32, leff). 

Def.: A net is called simple, if no two transitions t\ and t 2 have the same set of 
pre- and postconditions. 




Figure 2.32. Nets which are not pure (left) and not simple (right) 

Simple nets with no isolated elements meeting some additional restrictions are 
called condition/event nets. Condition/event nets are a special case of bipar- 
tite graphs (graphs with two disjoint sets of nodes). We will not discuss those 
additional restrictions in detail since we will consider more general classes of 
nets in the following. 

2.6.3 Place/transition nets 

For condition/event nets, there is at most one token per condition. For many 
applications, it is useful to remove this restriction and to allow more tokens 
per conditions. Nets allowing more than one token per condition are called 
place/transition nets. Places correspond to what we so far called conditions and 
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transitions correspond to what we so far called events. The number of tokens 
per place is called a marking. Mathematically, a marking is a mapping from 
the set of places to the set of natural numbers extended by a special symbol co 
denoting infinity. 

Let Wo denote the natural numbers including 0. Then, formally speaking, 
place/transition nets can be defined as follows: 

Def.: Mo) is called a place/transition net 

1 N = (P,T,F) is a. net with places p G P and transitions t £T. 

2 Mapping K : P ^ (WoU {co}) \ {0} denotes the capacity of places (co sym- 
bolizes infinite capacity). 

3 Mapping W : F ^ (Wq \ {0}) denotes the weight of graph edges. 

4 Mapping Mq : P Wq U {co} represents the initial marking of places. 

Edge weights affect the number of tokens that are required before transitions 
can happen and also identify the number of tokens that are generated if a certain 
transition takes place. Let M(p) denote a current marking of place p G P and 
let M'(p) denote a marking after some transition t G P took place. The weight 
of edges belonging to preconditions represents the number of tokens that are 
removed from places in the precondition set. Accordingly, the weight of edges 
belonging to the postconditions represents the number of tokens that are added 
to the places in the postcondition set. Formally, marking M' is computed as 
follows: 



M'(p) 



' M{p)-W{p,t), ifpG*t\P 

^ M{p)+W{t,p), ifpGt*\*t 

M{p) —W{p,t) +W{t,p), if p G *tnt* 
M{p) otherwise 



Fig. 2.33 shows an example of how transition tj affects the current marking. 





Figure 2. 33. Generation of a new marking 



By default, unlabeled edges are considered to have a weight of 1 and unlabeled 
places are considered to have unlimited capacity co. 
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We now need to explain the two conditions that must be met before a transition 
ter can take place: 

■ for all places p in the precondition set, the number of tokens must at least 
be equal to the weight of the edge from ptot and 

■ for all places p in the postcondition set, the capacity must be large enough 
to accommodate the new tokens which t will generate. 

Transitions meeting these two conditions are called M-activated. Formally, 
this can be defined as follows: 

Def.: Transition t G T is said to be M-activated 

(Vp G *t : M{p) > W{p,t)) A (Vp G t* : M{p) +W{t,p) < K{p)) 

Activated transitions can happen, but they do not need to. If several transi- 
tions are activated, the sequence in which they happen is not deterministically 
defined. 

For place/fransifion nefs, fhere are sfandard fechniques for generafing proofs 
of sysfem properties. For example, fhere may be subsefs of places for which 
fhe fofal number of fokens does nol change, no mailer which Iransilion fires 
[Reisig, 1985]. Such subsefs of places are called place invarianls. For ex- 
ample, fhe number of Irains commuling belween Cologne and Paris does nol 
change in our railway example. The same is Irue for fhe Irains Iraveling be- 
lween Amsterdam and Paris. Computing such invariants can be the standard 
point for verifying required system properties such as mutual exclusion. 

2.6.4 Predicate/transition nets 

Condition/event nets as well as place/transition nets can quickly become very 
large for large examples. A reduction of the size of the nets is frequently 
possible with predicate/transition nets. We will demonstrate this, using the so- 
called “dining philosophers problem” as an example. The problem is based on 
the assumption that a set of philosophers is dining at a round table. In front 
of each philosopher, there is a plate containing spaghetti. Between each of the 
plates, there is just one fork (see fig. 2.34). Each philosopher is eilher eating 
or Ihinking. Ealing philosophers need Iheir Iwo adjacenl forks for lhal, so Ihey 
can only eal if Iheir neighbors are nol eating. 

This silualion can be modeled as a condilion/evenl nel, as shown in fig. 2.35. 
Conditions correspond lo fhe Ihinking slates, conditions Oj correspond lo fhe 
eating slales, and conditions fy represenl available forks. 
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Figure 2.34- The dining philosophers problem 




Figure 2.35. Place/transition net model of the dining philosophers problem 



Considering the small size of the problem, this net is already very large. The 
size of this net can be reduced by using predicate/transition nets. Fig. 2.36 is a 
model of the same problem as a predicate/transition net. 

With predicate/transition nets, tokens have an identity and can be distinguished. 
We use this in fig. 2.36 in order to distinguish between the three different 
philosophers pi to pa and to identify fork fa. Furfhermore, edges can be labeled 
wifh variables and funcfions. In fhe example, we use variables fo represenf fhe 
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Figure 2.36. Predicate/transition net model of the dining philosophers problem 



identity of philosophers and functions l(x) and r(x) to denote the left and right 
forks of philosopher x, respectively. These two forks are required as a precon- 
dition for transition u and returned as a postcondition by transition v. Note that 
this model can be easily extended to the case of n > 3 philosophers. We just 
need to add more tokens. In contrast to the net in fig. 2.35, the structure of the 
net does not have to be changed. 

2.6.5 Evaluation 

The key advantage of Petri nets is their power for modeling causal depen- 
dencies. Standard Petri nets have no notion of time and all decisions can be 
taken locally, by just analyzing transitions and their pre- and postconditions. 
Therefore, they can be used for modeling geographically distributed systems. 
Furthermore, there is a strong theoretical foundation for Petri nets, simplifying 
formal proofs of systems properties. 

In certain contexts, their strength is also their weakness. If time is to be ex- 
plicitely modeled, standard Petri nets cannot be used. Furthermore, standard 
Petri nets have no notion of hierarchy and no programming language elements, 
let alone object oriented features. In general, it is difficult to represent data. 

There are extended versions of Petri nets avoiding the mentioned weaknesses. 
However, there is no universal extended version of Petri nets meeting all re- 
quirements mentioned at the beginning of this chapter. Nevertheless, due to 
the increasing amount of distributed computing, Petri nets became more pop- 
ular than they were initially. 

2.7 Message Sequence Charts 

Message sequence charts (MSCs) provide a graphical means for representing 
schedules. MSCs use one dimension (typically the vertical dimension) for 
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representing time, and the other dimension for representing geographieal dis- 
tribution. 

MSCs provide the right means for visualizing sehedules of trains or busses. 
Fig. 2.37 is an example. This example also refers to trains between Amster- 
dam, Cologne, Brussels and Paris. Aaehen is ineluded as an intermediate stop 
between Cologne and Brussels. Vertical segments correspond to times spent at 
stations. For one of the trains, there is a timing overlap between the trains com- 
ing from Cologne and Amsterdam at Brussels. There is a second train which 
travels between Paris and Cologne which is not related to an Amsterdam train. 




Figure 2.37. Message sequence diagram 

A more realistic example is shown in fig. 2.38. This example [Huerlimann, 
2003] describes simulated Swiss railway traffic in fhe Ldfschberg area. Slow 
and fasl frains can be distinguished by fheir slope in fhe graph. The figure 
includes informalion abouf fhe lime of fhe day. In Ibis conlexl, fhe diagram is 
called lime/disfance diagram. 

MSCs are appropriale means for representing lypical schedules. However, fhey 
fail lo provide informalion abouf necessary synchronization. For example, in 
fhe presented example if is nol known whelher fhe timing overlap al Brus- 
sels happens coincidenlally or whelher some real synchronizalion for connecl- 
ing frains is required. Furlhermore, permissible deviations from fhe presenled 
schedule (min/max timing behavior) can hardly be included in Ihese charls. 

2.8 UML 

All fhe languages presented so far require a ralher precise knowledge abouf 
fhe behavior of fhe system lo be specified. Frequenfly, and especially during 
fhe early specificalion phases, such knowledge is nol available. Very firsl ideas 
abouf syslems are frequenfly skelched on “napkins” or “envelopes”. Supporl 
for a more syslemafic approach lo Ihese firsl phases in a design process is fhe 
goal of fhe major so-called UML slandardizalion efforl. UML [OMG, 2005], 
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Figure 2.38. Railway traffic displayed by a message sequence diagram (courtesy H. Brandli, 
IVT, ETH Zurich), (glETH Zurich 

[Fowler and Scott, 1998] stands for “Unified Modeling Language”. UML was 
designed by leading software technology experts and is supported by commer- 
cial tools. UML primarily aims at the support of the software design process. 
UML contains a large number of diagram types and it is, by itself, a com- 
plex graphical language. Fortunately, most of the diagram types are variants of 
those graphical languages which we have already introduced in this book. 

Version 1 .4 of UML was not designed for embedded systems. Therefore, it 
lacks a number of features required for modeling embedded systems (see page 
13). In particular, the following features are missing [McLaughlin and Moore, 
1998]: 

■ the partitioning of software into tasks and processes cannot be modeled, 

■ timing behavior cannot be described at all, 

■ the presence of essential hardware components cannot be described. 

Due to the increasing amount of software in embedded systems, UML is gain- 
ing importance for embedded systems as well. Hence, several proposals for 
UML extensions to support real-time applications have been made [McLaugh- 
lin and Moore, 1998], [Douglass, 2000]. These extensions have been consid- 
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ered during the design of UML 2.0. UML 2.0 includes 13 diagram types (up 
from nine in UML 1.4) [Ambler, 2005]: 

■ Sequence diagrams: Sequence diagrams are variants of message sequence 
charts. Fig. 2.39 shows an example (based on an example from Gentleware 
AG [Poseidon, 2003]). 



Client 



website 

1 



go to Catalog:navigate ^ I 
displayC^atalqg^webpage 



selectproduct:Nav 



requestlogin:webp_age 



Figure 2.39. Segment from an UML sequence diagram 

One of the key distinction between the type of diagrams shown in figs. 
2.38 and 2.39 is that fig. 2.39 does nof include any reference fo real time. 
UML version 1.4 was not designed for real-time applications. Some of the 
restrictions of UML 1.4 have been removed in UML 2.0. 

■ State machine diagrams (called State Diagrams in version 1 of UML): 
UML includes a variation of StateCharts and hence allows modeling state 
machines. 

■ Activity diagrams: In essence, activity diagrams are extended Petri nets. 
Extensions include symbols denoting decisions (just like in ordinary flow 
charts). The placement of symbols is somewhat similar to SDL. Fig. 2.40 
shows an example. 

The example shows the procedure to be followed during a standardization 
process. Forks and joins of control correspond to transitions in Petri nets 
and they use the symbols (horizontal bars) that were initially used for Petri 
nets as well. The diamond at the bottom shows the symbol used for deci- 
sions. Activities can be organized into “swim-lanes” (areas between verti- 
cal dotted lines) such that the different responsibilities and the documents 
exchanged can be visualized. 

■ Deployment diagram: These diagrams are important for embedded sys- 
tems: they describe the “execution architecture” of systems (hardware or 
software nodes). 
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Figure 2.40. Activity diagram [Kobryn, 2001] 



■ Package diagrams: Package diagrams represent the partitioning of soft- 
ware into software packages. They are similar to module charts in State- 
Mate. 

■ Use case diagrams: These diagrams capture typical application scenarios 
of the system to be designed. For example, fig. 2.41 [Ambler, 2005] shows 
scenarios for customers of some bank. 

■ Class diagrams: These diagrams describe inheritance relations of object 
classes. 
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customer 




open accounT^ 
C]^[de^sit fund?^ 
withdraw funds^ 
close account^ 



Figure 2.41- Use case example 



■ Timing diagrams; They can be used to show the change of the state of an 
object over time. 



■ Communication diagram (called Collaboration diagrams in UML 1.x): 
These graphs represent classes, relations between classes, and messages 
that are exchanged between them. 



■ Component diagrams: They represent the components used in applica- 
tions or systems. 



■ Object diagrams, interaction overview diagrams, composite structure 
diagrams: This list consists of three types of diagrams which are less fre- 
quently used. Some of them may actually be special cases of other types of 
diagrams. 



Currently available tools, for example from ilogix (see http://www.ilogix.com), 
provide some consistency checking between the different diagram types. Com- 
plete checking, however, seems to be impossible. One reason for this is that 
the semantics of UML initially was left undefined. It has been argued that 
this was done intentionally, since one does not like to bother about the precise 
semantics during the early phases of the design. As a consequence, precise, 
executable specifications can only be obtained if UML is combined with some 
other, executable language. Available design tools have combined UML with 
SDL [Telelogic, 1999] and C-i-i-. There are, however, also some first attempts 
to define the semantics of UML. 

In this book, we will not discuss UML in further detail, since all the relevant di- 
agram types have already been described. Nevertheless, it is interesting to note 
how a technique like Petri nets was initially certainly not a mainstream tech- 
nique. Decades after its invention, it has become a frequently applied technique 
due to its inclusion in UML. 
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2.9 Process networks 
2.9.1 Task graphs 

Process networks have already been mentioned in the context of computa- 
tional models. Process networks are modeled with graphs. We will use the 
names task graphs and process networks interchangeably, even though these 
terms were created by different communities. Nodes in the graph represent 
processes performing operations. Processes map input data streams to output 
data streams. Processes are often implemented in high-level programming lan- 
guages. Typical processes contain (possibly non-terminating) iterations. In 
each cycle of the iteration, they consume data from their inputs, processes the 
data received, and generate data on their output streams. Edges represent rela- 
tions between processes. We will now introduce these graphs at a more detailed 
level. 

The most obvious relation between processes is their causal dependence: Many 
processes can only be executed after other processes have terminated. This 
dependence is typically captured in dependence graphs. Fig. 2.42 shows a 
dependence graph for a set of processes or tasks. 




Figure 2-42. Dependence graph 

Def.: A dependence graph is a directed graph G = iy,E) m which E GV xV 
is a partial order. If (vi , V 2 ) G E, then v\ is called an immediate predecessor of 
V 2 and V 2 is called an immediate successor of vi. Suppose E* is the transitive 
closure of E. If (vi,V 2 ) G E*, then v\ is called a predecessor of V 2 and V 2 is 
called a successor of vi. 

Such dependence graphs form a special case of task graphs. Task graphs 
represent relations between a set of processes. Task graphs may contain more 
information than modeled in the dependence graph in fig. 2.42. For example, 
task graphs may include the following extensions of dependence graphs: 

1 Timing information: Tasks may have arrival times, deadlines, periods, and 
execution times. In order to take these into account while scheduling tasks, 
it may be useful to include this information in the task graphs. Adopting the 
notation used in the book by Fiu [Fiu, 2000], we include possible execution 
intervals in fig. 2.43. Tasks Ti to T 3 are assumed to be independent. 
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Figure 2.4-3. Task graphs including timing information 



Significantly more complex combinations of timing and dependence rela- 
tions can exist. 

2 Distinction between different types of relations between tasks: Prece- 
dence relations just model constraints for possible execution sequences. At 
a more detailed level, it may be useful to distinguish between constraints for 
scheduling and communication between tasks. Communication can again 
be described by edges, but additional information may be available for each 
of the edges, such as the time of the communication and the amount of in- 
formation exchanged. Precedence edges may be kept as a separate type of 
edges, since there could be situations in which processes have to execute 
sequentially even though they do not exchange information. 

In fig. 2.42, input and output (I/O) is not explicitly described. Implicitly 
it is assumed that tasks without any predecessor in the task graph might 
be receiving input at some time. Also, it is assumed that they generate 
output for the successor task and that this output is available only after 
the task has terminated. It is often useful to describe input and output more 
explicitly. In order to do this, another kind of relation is required. Using the 
same symbols as Thoen [Thoen and Catthoor, 2000], we use partially tilled 
circles for denoting input and output. In fig. 2.44, tilled circles identify I/O 
edges. 




Figure 2.44- Task graphs including I/O-nodes and edges 



3 Exclusive access to resources: Tasks may be requesting exclusive access 
to some resource, for example to some input/output device or some com- 
munication area in memory. Information about necessary exclusive access 
should be taken into account during scheduling. Exploiting this informa- 
tion might, for example, be used to avoid the priority inversion problem 
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(see page 141). Information concerning exclusive access to resources can 
be included in task graphs. 

4 Periodic schedules: Many tasks, especially in digital signal processing, are 
periodic. This means that we have to distinguish more carefully between 
a task and its execution (the latter is frequently called a job [Liu, 2000]). 
Task graphs for such schedules are infinite. Fig. 2.45 shows a task graph 
including jobs Jn-i to Jn+i of a periodic task. 




Figure 2.45. Task graph including jobs of a periodic task 

5 Hierarchical graph nodes: The complexity of the computations denoted 
by graph nodes may be quite different. On one hand, specified programs 
may be quife large and confain fhousands of lines of code. On fhe ofher 
hand, programs can be splif info small pieces of code so fhaf in fhe exfreme 
case, each of fhe nodes corresponds only fo a single operation. The level 
of complexify of graph nodes is also called fheir granularity. Which gran- 
ularify should be used? There is no universal answer fo fhis. For some 
purposes, fhe granularify should be as large as possible. For example, if we 
consider each of fhe nodes as one process fo be scheduled by fhe RTOS, if 
may be wise fo work wifh large nodes in order fo minimize confexf-swifches 
befween differenf processes. For ofher purposes, if may be better fo work 
wifh nodes modeling Jusf a single operation. For example, nodes will have 
fo be mapped fo hardware or fo soflware. If a cerfain operafion (like fhe 
frequenlly used Discrefe Cosine Transform, or DCT) can be mapped fo 
special purpose hardware, fhen if should nof be buried in a complex node 
fhaf confains many ofher operations. If should rafher be modeled as ifs own 
node. In order fo avoid frequenl changes of fhe granularity, hierarchical 
graph nodes are very useful. For example, af a high hierarchical level, fhe 
nodes may denote complex fasks, af a lower level basic blocks and af an 
even lower level individual arilhmefic operafions. Fig. 2.46 shows a hier- 
archical version of fhe dependence graph in fig. 2.42, using a recfangle fo 
denote a hierarchical node. 

A very comprehensive fask graph model, called multi-thread graph (MTG), 
was proposed by Thoen [Thoen and Caffhoor, 2000]. MTGs are defined as 
follows: 

Def.: A mulfi-lhread graph M is defined as an 1 1-fuple 

(O,£',F,D,'0,l,A,£'"",£''“P,V',V“’') where 
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Figure 2.46. Hierarchical task graph 

■ O is the set of operation nodes. They ean be of different types, ineluding 
thread, hierarchical thread, or, event, synchro, semaphore, source and sink. 
MTGs have single sourees and sinks of type source and sink, respectively. 
Nodes of type or allow modeling situations in which only one of a set of 
tasks is required in order to start the next task. Events model external input. 
Semaphores can be used to model mutual exclusion. Synchro nodes provide 
acknowledgments to the environment. 

■ £■ is the set of control edges. Attributes of control edges include timing 
information like production and consumption rate. 

■ V, D, and fl refer to the access of variables, not discussed in detail in this 
text. 

■ t is the set of input/output nodes, 

■ A associates execution latency intervals with all threads, 

■ V' and are timing constraints. 

As can be seen from the definition, almost all of the presented extensions of 
simple precedence graphs are included in MTGs. MTGs are used for the work 
described starting at page 191. 

2.9.2 Asynchronous message passing 

For asynchronous message passing, communication between processes is buff- 
ered. Typically, buffers are assumed to be FIFOs of theoretically unbounded 
length. 

2.9.2. 1 Kahn process networks 

Kahn process networks (KPN) [Kahn, 1974] are a special case of such process 
networks. For KPN, writes are non-blocking, whereas reads block whenever 
an attempt is made to read from an empty FIFO queue. There is no other way 
for communication between processes except through FIFO-queues. Only a 
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single process is allowed to read from a certain queue. So, if output data has 
to be sent to more than a single process, duplication of data must be done 
inside processes. In general, Kahn processes require scheduling at run-time, 
since it is difficult to predict their precise behavior over time. The question of 
whether or not all finite-length FIFOs are sufficient for an actual KPN model 
is undecidable in the general case. Nevertheless, practically useful algorithms 
exist [Kienhuis et ah, 2000]. 

2. 9. 2. 2 Synchronous data flow 

The synchronous data flow (SDF) model [Lee and Messerschmitt, 1987] can 
best be introduced by referring to its graphical notation. Fig. 2.47 (left) shows 
a synchronous data flow graph. The graph is a directed graph, nodes A and 
B denote computations * and +. SDF graphs, like all data flow graphs, show 
computations to be performed and their dependence, but not the order in which 
the computations have to be performed (in contrast to specifications in sequen- 
tial languages such as C). Inputs to SDF graphs are assumed to consist of an 
infinite stream of samples. Nodes can start their computations when their in- 
puts are available. Edges must be used whenever there is a data dependency 
between any two nodes. 




Figure 2.Jfl. Graphical representations of synchronous data flow 

For each execution, the computation in a node is called a firing. For each 
firing, a number of tokens, representing data, is consumed and produced. In 
synchronous data flow, the number of tokens produced or consumed in one 
firing is constant. Constant edge labels denote the corresponding numbers of 
tokens. These constants facilitate the modeling of multi-rate signal processing 
applications (applications for which certain signals are generated at frequen- 
cies that are multiples of other frequencies). The term synchronous data flow 
reflects the fact that tokens are consumed from the incoming arcs in a syn- 
chronous manner (all at the same instant in time). The term asynchronous 
message passing reflects the fact that tokens can be buffered using FIFOs. The 
property of producing and consuming a constant number of tokens makes it 
possible to determine execution order and memory requirements at compile 
time. Hence, complex run-time scheduling of executions is avoided. SDF 
graphs may include delays, denoted by the symbol D on an edge (see fig. 2.47 
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(right)). SDF graphs can he translated into periodic schedules for mono- as 
well as for multi-processor systems (see e.g. [Pino and Lee, 1995]]. A legal 
schedule for the simple example of fig. 2.47 would consist of the sequence (A, 
B) (repeated forever). A sequence (A, A, B) (A executed twice as many times 
as B) would he illegal, since it would accumulate an infinite number of tokens 
on the implicit FIFO buffer between A and B. 

2.9.3 Synchronous message passing 

2.9.3. 1 CSP 

CSP [Hoare, 1985] {communicating sequential processes) is one of the first 
languages comprising mechanisms for interprocess communication. Commu- 
nication is based on channels. 

Example: 

process A process B 

var a .. var b ... 

a:=3; 

c!a; -- output to channel c c?b; -- input from channel c 

end; end; 

Both processes will wait for the other process to arrive at the input or out- 
put statement. This form of communication is called rendez-vous concept or 
blocking communication. 

CSP has laid the foundation for the OCCAM language that was proposed as a 
programming language of the transputer [Thiebaut, 1995]. 

2.9.3. 2 ADA 

During the eighties, the Department of Defense (DoD) in the US realized that 
the dependability and maintainability of the software in its military equipment 
could soon become a major source of problems, unless some strict policy was 
enforced. It was decided that all software should be written in the same real- 
time language. Requirements for such a language were formulated. No exist- 
ing language met the requirements and, consequently, the design of a new one 
was started. The language which was finally accepfed was based on PASCAL. 
If was called ADA (after Ada Lovelace, who can be considered being fhe firsl 
(female) programmer). ADA’95 [Kempe, 1995], [Burns and Wellings, 2001] 
is an objecf-orienled extension of fhe original sfandard. 

One of fhe interesting fealures of ADA is fhe abilify fo have nesfed declara- 
fions of processes (called fasks in ADA). Tasks are sfarfed whenever confrol 




56 



EMBEDDED SYSTEM DESIGN 



passes into the scope in which they are declared. The following is an example 
(according to Bums et al. [Burns and Wellings, 1990]): 

procedure examplel is 
task a; 
task b; 

task body a is 

- - local declarations for a 

begin 

- - statements for a 

end a; 

task body b is 

- - local declarations for b 

begin 

- - statements for b 

end b; 
begin 

- - Tasks a and b will start before the first 

- - statement of the body of examplel 

end; 

The communication concept of ADA is another key concept. It is based on 
the rendez-vous paradigm. Whenever two tasks want to exchange informa- 
tion, the task reaching the “meeting point” first has to wait until its partner 
has also reached a corresponding point of control. Syntactically, procedures 
are used for describing communication. Procedures which can be called from 
other tasks have to be identified by the keyword entry. Example [Burns and 
Wellings, 1990]: 

task screen_out is 

entry call (val : character; x, y : integer); 
end screen.out; 

Task screen_out includes a procedure named call which can be called from 
other processes. Some other task can call this procedure by prefixing it with 
the name of the task: 



screen_out.call(’Z’, 10,20); 
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The calling task has to wait until the called task has reached a point of control, 
at which it accepts calls form other tasks. This point of control is indicated by 
the keyword accept: 

task body screen.out is 
begin 

accept call (val ; character; x, y : integer) do 
end call; 
end screen_out; 

Obviously, task screen_out may be waiting for several calls at the same time. 
The ADA select-statement provides this capability. Example: 

task screen_output is 
entry calLch(val:character; x, y: integer); 
entry callJnt(z, x, y: integer); 
end screen_out; 
task body screen.output is 

select 

accept calLch ... do... 
end calLch; 

or 

accept cal Lint ... do .. 
end calIJnt; 
end select; ... 

In this case, task screen_out will be waiting until either calLch or calLint are 
called. 

ADA is the language of choice for almost all military equipment produced in 
the Western hemisphere. 

Again, process networks are not explictly represented as graphs, but these 
graphs can be generated from the textual representation. 
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2.10 Java 

Java was designed as a platform-independent language. It can be executed 
on any machine for which an interpreter of the internal byte-code represen- 
tation of Java-programs is available. This byte-code representation is a very 
compact representation, which requires less memory space than a standard bi- 
nary machine code representation. Obviously, this is a potential advantage in 
system-on-a-chip applications, where memory space is limited. 

Also, Java was designed as a safe language. Many potentially dangerous fea- 
tures of C or C-i-i- (like pointer arithmetic) are not available in Java. Hence, 
Java meets the safety requirements for specification languages for embedded 
systems. Java supports exception handling, simplifying recovery in case of 
run-time errors. There is no danger of memory leakages due to missing mem- 
ory deallocation, since Java provides automatic garbage collection. This fea- 
ture avoids potential problems in applications that have to run for months or 
even years without ever being restarted. Java also meets the requirement to 
support concurrency since it includes threads (light-weight processes). 

In addition, Java applications can be implemented quite fast, since Java sup- 
ports object orientation and since Java development systems come with pow- 
erful libraries. 

However, standard Java is not really designed for real-time systems and a num- 
ber of characteristics which would make it a real-time programming language 
are missing: 



■ The size of Java run-time libraries has to be added to the size of the ap- 
plication itself. These run-time libraries can be quite large. Consequently, 
only really large applications benefit from the compact representation of 
the application itself. 

■ For many embedded applications, direct control over I/O devices is neces- 
sary (see page 16). For safety reasons, no direct control over I/O devices is 
available in standard Java. 

■ Automatic garbage collection requires some computing time. In standard 
Java, the instance in time at which automatic garbage collection is started 
cannot be predicted. Hence, the worst case execution time is very difficult 
to predict. Only extremely conservative estimates can be made. 

■ Java does not specify the order in which threads are executed if several 
threads are ready to run. As a result, worst-case execution time estimates 
must be even more conservative. 
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First proposals for solving the problems were made by Nilsen. Proposals 
include hardware-supported garbage-collection, replacement of the run-time 
scheduler and tagging of some of the memory segments [Nilsen, 2004]. 

In 2003, relevant Java programming environments included the Java Enter- 
prise Edition (J2EE), the Java Standard Edition (J2SE), the Java Micro Edition 
(J2ME), and CardJava. CardJava is a stripped-down version of Java with em- 
phasis on security for SmartCard applications. J2ME is the relevant Java en- 
vironment for all other types of embedded systems. Two library profiles have 
been defined for J2ME: CDC and CEDC. CEDC is used for mobile phones, 
using the so-called MIDP 1. 0/2.0 as its standard for the application program- 
ming interface (API). CDC is used, for example, for TV sets and powerful 
mobile phones. The currently relevant real-time extension of Java is called 
“Real-time specification for Java (JSR-1)” [Java Community Process, 2002] 
and is supported by TimeSys [TimeSys Inc., 2003]. 

2.11 VHDL 

2.11.1 Introduction 

Eanguages for describing hardware, such as VHDE, are called hardware de- 
scription languages (HDEs). Up to the eighties, most design systems used 
graphical HDEs. The most common building block was a gate. However, in 
addition to using graphical HDEs, we can also use textual HDEs. The strength 
of textual languages is that they can easily represent complex computations 
including variables, loops, function parameters and recursion. Accordingly, 
when digital systems became more complex in the eighties, textual HDEs al- 
most completely replaced graphical HDEs. Textual HDEs were initially a re- 
search topic at Universities. See Mermet et al. [Mermet et ah, 1998] for a 
survey of languages designed in Europe in the eighties. MIMOEA was one 
of these languages and the author of this book contributed to its design and 
applications [Marwedel and Schenk, 1993]. Textual languages became pop- 
ular when VHDE and its competitor Verilog (see page 75) were introduced. 
VHDE was designed in the context of the VHSIC program of the Department 
of Defense (DoD) in the US. VHSIC stands for vety high speed integrated cir- 
cuits'^. Initially, the design of VHDE (VHSIC hardware description language) 
was done by three companies: IBM, Intermetrics and Texas Instruments. A 
first version of VHDE was published in 1984. Eater, VHDE became an IEEE 
standard, called IEEE 1076. The first IEEE version was standardized in 1987; 
updates were designed in 1992, in 1997 and in 2002. 



^The design of the Internet was also part of the VHSIC program. 
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A key distinction between common software languages and hardware descrip- 
tion languages is the need to describe concurrency among different hardware 
components. VHDL uses processes for doing this. Each process models one 
component of the potentially concurrent hardware. For simple hardware com- 
ponents, a single process may be sufficient. More complex components may 
need several processes for modeling their operation. Processes communicate 
through signals. Signals roughly correspond to physical connections (wires). 
Another distinction between software languages and HDLs comes from the 
need to model time. VHDL, like all other HDLs, includes the necessary sup- 
port. 

The design of VHDL used ADA as the starting point, since both languages 
were designed for the DoD. Since ADA is based on PASCAL, VHDL has some 
of the syntactical flavor of PASCAL. However, the syntax of VHDL is much 
more complex and it is necessary not to get distracted by the syntax. In the 
current book, we will just focus on some concepts of VHDL which are useful 
also in other languages. A full description of VHDL is beyond the scope of 
this book. The entire standard is available from IEEE (see [IEEE, 1992]). 

2.11.2 Entities and architectures 

In VHDL, each unit to be modeled is called a design entity or a VHDL entity. 
Design entities are composed of two types of ingredients: an entity declara- 
tion and one (or several) architectures (see fig. 2.48). For each entity, the 
most recently analyzed architecture will be used by default. Using other archi- 
tectures can be specified. 




Figure 2.48. An entity consists of an entity declaration and architectures 

We will consider a full adder as an example. Full adders have three input ports 
and two output ports (see fig. 2.49). 



a 

b 

carry_in 



fulLadder 



- sum 

^ carry_out 



Figure 2.49. Full-adder and its interface signals 
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An entity declaration corresponding to fig. 2.49 is the following: 
entity fulLadder is - - entity declaration 

port (a, b, carryJn: in Bit; -- input ports 
sum, carry _out: out Bit); -- output ports 
end fulLadder; 

Architectures consist of architecture headers and architectural bodies. We can 
distinguish between different styles of bodies, in particular between structural 
and behavioral bodies. We will show how the two are different using the full 
adder as an example. Behavioral bodies include just enough information to 
compute output signals from input signals and the local state (if any), including 
the timing behavior of the outputs. The following is an example of this (<= 
denotes assignments to signals): 

architecture behavior of fulLadder is - - architecture 

begin 

sum <= (a xor b) xor carryJn after 10 Ns; 
carry _out <= (a and b) or (a and carry.in) or 
(b and carryJn) after 10 Ns; 
end behavior; 

VHDL-based simulators are capable of displaying output signal waveforms 
resulting from stimuli applied to the inputs of the full adder described above. 

In contrast, structural bodies describe the way entities are composed of simpler 
entities. For example, the full adder can be modeled as an entity consisting of 
three components (see tig. 2.50). These components are called i1 to i3 and are 
of type half_adder or or_gate. 




Figure 2.50. Schematic describing structural body of the full adder 

In the 1987 version of VHDL, these components must be declared in a so- 
called component declaration. This declaration is very similar (and it serves 
the same purpose) as forward declarations in other languages. This declaration 
provides the necessary information about the component even if the full de- 
scription of that component is not yet stored in the VHDL data base (this may 
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happen in the ease of so-ealled top-down designs). From the 1992 version of 
VHDL onwards, such declarations are not required if the relevant components 
are already stored in the component data base. 

Connections between local component and entity ports are described in port 
maps. The following VHDL code represents the structural body shown in fig. 
2.50: 

architecture structure of fulLadder is - - architecture head 
component halLadder 

port (in1 , in2 : in Bit; carry :out Bit; sum :out Bit); 
end component; 
component or.gate 
port(in1, in2:in Bit; o:out Bit); 
end component; 

signai x, y, z: Bit; - - local signals 
begin - - port map section 

i1: halLadder -- introduction of halLadder i1 

port map (a, b, x, y); -- connections between ports 
i2: half_adder port map (y, carryJn, z, sum); 
i3: or_gate port map (x, z, carry_out); 
end structure; 



2.11.3 Multi-valued logic and IEEE 1164 

In this book, we are restricting ourselves to embedded systems implemented 
with binary logic. Nevertheless, it may be advisable or necessary to use more 
than two values for modeling such systems. For example, our systems might 
contain electrical signals of different strengths and it may be necessary to com- 
pute the strength and the logic level resulting from a connection of two or more 
sources of electrical signals. In the following, we will therefore distinguish be- 
tween the level and the strength of a signal. While the former is an abstraction 
of the signal voltage, the latter is an abstraction of the impedance (resistance) 
of the voltage source. We will be using discrete sets of signal values represent- 
ing the signal level and the strength. Using discrete sets of strengths avoids 
the problems of having to solve Kirchhoff ’s equations and enables us to work 
with algebraic techniques. We will also model unknown electrical signals by 
special signal values. 
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In practice, electronic design systems use a variety of value sets. Some systems 
allow only two, while others allow 9 or 46. The overall goal of developing 
discrete value sets is to avoid the problems of solving network equations (e.g. 
Kirchoff’s laws) and still model existing systems with sufficient precision. In 
the following, we will present a systematic technique for building up value 
sets and for relating these to each other. We will use the strength of electrical 
signals as the key parameter for distinguishing between various value sets. A 
systematic way of building up value sets, called CSA-theory, was presented by 
Hayes [Hayes, 1982]. We will later show how the standard value set used for 
most cases of VHDL-based modeling can be derived as a special case. 

2.11.3.1 Two logic values (1 signal strength) 

In the simplest case, we will start with just two logic values, called ’0’ and ’T. 
These two values are considered to be of the same strength. This means: if two 
wires connect values ’0’ and ’T, we will not know anything about the resulting 
signal level. 

A single signal strength may be sufficient if no two wires carrying values ’0’ 
and ’T are connected and no signals of different strength meet at a particular 
node of electronic circuits. 

2.11.3.2 Three and four logic values (2 signal strengths) 

In many circuits, there may be instances in which a certain electrical signal is 
not actively driven by any output. This may be the case, when a certain wire is 
not connected to ground, the supply voltage or any circuit node. 

For example, systems may contain open-collector outputs (see fig. 2.51, leff) 
or frisfafe oufpufs (see fig. 2.51, righf). Using appropriate inpuf signals, such 
oufpufs can be effeclively disconnected from a wire^. 

Obviously, fhe signal sfrengfh of disconnected oufpufs is fhe smallesf sfrengfh 
fhaf we can fhink of. In particular, fhe signal sfrengfh of Z is smaller fhan fhaf 
of ’0’ and ’1 ’. Furfhermore, fhe signal level of such an oufpuf is unknown. This 
combination of signal sfrengfh and signal value is represenfed by a logic value 
called ’Z’. If a signal of value ’Z’ is connected fo anofher signal, fhaf ofher signal 
will always dominafe. For example, if fwo frisfafe oufpufs are connected fo fhe 
same bus and if one oufpuf confribufes a value of ’Z’, fhe resulting value on fhe 
bus will always be fhe value confribufed by fhe second oufpuf (see fig. 2.52). 



^In practice, pull-up transistors may be depletion transistors and the tri-state outputs may be inverting. 
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GROUND GROUND 

Input = ’0’ -> A disconnected enable = ’0’ -> A disconnected 



Figure 2.51. Outputs that can be effectively disconnected from a wire 




GROUND 



Figure 2. 52. Right output dominates bus 

In VHDL, each output is associated with a so-called signal driver. Computing 
the value resulting from the contributions of multiple drivers to the same sig- 
nal is called resolution and resulting values are computed by functions called 

resolution functions. 

In most, cases three-valued logic sets {’0’,’1 ’,’Z’} are extended by a fourth value 
called ’X’. ’X’ represents an unknown signal level of the same strength as ’0’ or 
’1’. More precisely, we are using ’X’ to represent unknown values of signals 
that can be either ’0’ or ’1’ or some voltage representing neither ’0’ nor ’1’^. 

The resolution that is required if multiple drivers get connected can be com- 
puted very easily, if we make use of a partial order among the four signal values 
’O’, ’T, ’Z’, and ’X’. The partial order is depicted in fig. 2.53. 

Edges in this figure reflect the domination of signal values. Edges define a 
relation >. If a > b, then a dominates b. ’0’ and ’T dominate ’Z’. ’X’ dominates 



®There are other interpretations of 'X', but the one presented above is the most useful one in our context. 
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’X’ 

y \ 

’ 0 ’ ’ 1 ’ 

X / 

’Z’ 

Figure 2.53. Partial order for value set {’O’, ’1’, ’Z’, ’X’} 

all other signal values. Based on the relation >, we define a relation >. a>b 
holds iff a > b or a = b. 

We define an operation sup on two signals, whieh returns the supremum of 
the two signal values. The supremum c of the two values a and b is the weak- 
est value for which c > a and c > b holds. For example, sup (’Z’, ’0’)=’0’, 
sup{’Z’,’^’)=’^’ etc. The interesting observation is that resolution functions 
should compute the sup function according to the above definition. 

2.11.3.3 Seven signal values (3 signal strengths) 

In many circuits, two signal strengths are not sufficient. A common case that 
requires more values is the use of depletion transistors (see fig. 2.54). 




GROUND 



Figure 2.54. Output using depletion transistor 

The effect of the depletion transistor is similar to that of a resistor providing a 
low conductance path to the supply voltage VDD. The depletion transistor as 
well as the “pull-down transistor” PD act as drivers for node A of the circuit 
and the signal value at node A can be computed using resolution. The pull- 
down transistor provides a driver value of ’0’ or ’Z’, depending upon the input 
to PD. The depletion transistor provides a signal value, which is weaker than ’0’ 
and ’1’. Its signal level corresponds to the signal level of ’T. We represent the 
value contributed by the depletion transistor by ’H’, and we call it a “weak logic 
one”. Similarity, there can be weak logic zeros, represented by ’L’. The value 
resulting from the possible connection between ’H’ and ’L is called a “weak 
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logic undefined”, denoted as ’W’. As a result, we have three signal strengths 
and seven logic values {’O’, ’1’, ’Z’, ’X’, ’H’, ’L’, ’W’}. Resolution can again be 
based on a partial order among these seven values. The corresponding partial 
order is shown in tig. 2.55. 



’X’ 

/ \ 

’ 0 ’ ’ 1 ’ 

X / 

’W’ 

X X 

’L’ ’H’ 

\ / 

’Z’ 
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} 



strongest 



medium strength 



weakest 



Figure 2.55. Partial order for value set {’O’, ’1’, ’Z’, ’X’, ’H’, ’L, ’W’} 



This order also defines an operation sup returning the weakest value at least as 
strong as the two arguments. For example, 5up(’H’,’0’) = ’O’, 5«p(’H’,’Z’) = ’H’, 
supCU’X) = ’W’. 

’0’ and ’L represent the same signal levels, but a different strength. The same 
holds for the pairs ’T and ’H’. Devices increasing the signal strength are called 
amplifiers, devices reducing the signal strength are called attenuators. 

2.11.3.4 Ten signal values (4 signal strengths) 

In some cases, three signal strengths are not sufficient. For example, there 
are circuits using charges stored on wires. Such wires are charged to levels 
corresponding to ’0’ or ’1’ during some phases of the operation of the electronic 
circuit. This stored charge can control the (high impedance) inputs of some 
transistors. However, if these wires get connected to even the weakest signal 
source (except ’Z’), they loose their charge and the signal value from that source 
dominates. 

For example, in fig. 2.56, we are driving a bus from a specialized output. The 
bus has a high capacitive load C. While function f is still ’O’, we set {[) to ’T, 
charging capacitor C. Then we set (|) to ’O’. If the real value of function f be- 
comes known and it turns out to be ’1 ’, we discharge the bus. The key reason for 
using pre-charging is that charging a bus using an output like the one shown in 
fig. 2.54 is a slow process, since the resistance of depletion transistors is large. 
Discharging through regular pull-down transistors PD is a much faster process. 
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Figure 2.56. Pre-charging a bus 



In order to model sueh eases, we need signal values which are weaker than ’H’ 
and ’L, but stronger than ’Z’. We call such values “very weak signal values” 
and denote them by ’h’ and T. The corresponding very weak unknown value is 
denoted by ’w’. As a result, we obtain ten signal values {’O’, ’T, ’Z’, ’X’, ’H’, ’L, 
’W’, ’h’, ’I’, ’w’}. Using the signal strength, we can again define a partial order 
among these values (see fig. 2.57). 
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Figure 2.57. Partial order for value set {’O’, ’1’, ’Z’, ’X’, ’H’, ’L’, ’W’, ’h’, ’I’, ’w’} 



2.11.3.5 Five signal strengths 

So far, we have ignored power supply signals. These are stronger than the 
strongest signals we have considered so far. Signal value sets taking power 
supply signals into account have resulted in the definition of 46-valued value 
sets [Coelho, 1989]. However, such models are not very popular. 
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2.11.3.6 IEEE 1164 

In VHDL, there is no predefined number of signal values, exeept for some 
basie support for two-valued logic. Instead, the used value sets can be defined 
in VHDL itself and different VHDL models can use different value sets. 

However, portability of models would suffer severely if this capability of VHDL 
was applied in this way. In order to simplify exchanging VHDL models, a 
standard value set was defined and standardized by the IEEE. This standard is 
called IEEE 1164 and is employed in many system models. IEEE 1164 has 
nine values: {’O’, ’1’, ’Z’, ’X’, ’H’, ’L, ’W’, ’U’, ’-’}. The first seven values corre- 
spond to the seven signal values described above. ’U’ denotes an uninitialized 
value. It is used by simulators for signals that have not been explicitely defined. 

denotes the input don’t care. This value needs some explanation. Ere- 
quently, hardware description languages are used for describing Boolean func- 
tions. The VHDE select statement is a very convenient means for doing that. 
The select statement corresponds to switch and case statements found in 
other languages and its meaning is different from the select statement in ADA. 

Example: Suppose that we would like to represent the Boolean function 

f{a,b,c) =ab + bc 

Eurthermore, suppose that / should be undefined for the case of a = = c =’0’ . 
A very convenient way of specifying this function would be the following: 

t <= select a&b&c -- & denotes concatenation 
’T when "10-" -- corresponds to first term 

’T when "-11 " -- corresponds to second term 

’X’ when "000" 

This way, functions given above could be easily translated into VHDE. Unfor- 
tunately, the select statement denotes something completely different. Since 
IEEE 1164 is just one of a large number of possible value sets, it does not in- 
clude any knowledge about the “meaning” of Whenever VHDE tools evalu- 
ate select statements like the one above, they check if the selecting expression 
(a & b & c in the case above) is equal to the values in the when clauses. In par- 
ticular, they check if e.g. a & b & c is equal to "10-". In this context, ’-’ behaves 
like any other value: VHDE systems check if c has a value of Since ’-’ is 
never assigned to any of the variables, these tests will never be true. Therefore, 
is of limited benefit. The non-availability of convenient input don’t care 
values is the price that one has to pay for the flexibility of defining value sets 
in VHDE itself. 
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The nice property of the general discussion on pages 63 to 67 is the following: 
it allows us to immediately draw conclusions about the modeling power of 
IEEE 1164. The IEEE standard is based on the 7-valued value set described 
on page 65 and, therefore, is capable of modeling circuits containing depletion 
transistors. It is, however, not capable of modeling charge storage^. 

2.11.4 VHDL processes and simulation semantics 

VHDE treats components described above as processes. The syntax used 
above is just a shorthand for processes. The general syntax for processes is 
as follows: 

label : - - optional 

process 

declarations -- optional 

begin 

statements- - optional 

end process ; 

Processes may contain wait statements. Such statements can be used to sus- 
pend a process. There are the following kinds of wait statements: 

■ wait on signal list, suspend until one of the signals in the list changes; 

■ wait untii condition-, suspend until condition is met, e.g. a = ’1’; 

■ wait for duration-, suspend for a specified period of time; 

■ wait; suspend indefinitely. 

As an alternative to explicit wait statements, a list of signals can be added to 
the process header. In that case, the process is activated whenever one of the 
signals in that list changes its value. Example: The following model of an and- 
gate will execute its body once and will restart from the beginning every time 
one of the inputs changes its value: 

process(x, y) begin 
prod <= X AND y ; 
end process; 



^As an exception, if the capability of modeling depletion transistors or pull-up resistors is not needed, one 
could interpret weak values as stored charges. This is, however, not very practical since pull-up resistors 
are found in most actual systems. 
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This model is equivalent to 

process begin 

prod <= X AND y ; 
wait on x,y; 
end process; 

According to the original standards document [IEEE, 1997], the execution of a 
VHDE model is described as follows: The execution of a model consists of an 
initialization phase followed by the repetitive execution of process statements 
in the description of that model. Each such repetition is said to be a simulation 
cycle. In each cycle, the values of all signals in the description are computed. 
If as a result of this computation an event occurs on a given signal, process 
statements that are sensitive to that signal will resume and will be executed as 
part of the simulation cycle. 

The initialization phase takes signal initializations into account and executes 
each process once. It is described in the standards as follows*: 

At the beginning of initialization, the current time, T^. is assumed to be 0 ns. 
The initialization phase consists of the following steps 

■ The driving value and the effective value of each explicitly declared signal 
are computed, and the current value of the signal is set to the effective 
value. This value is assumed to have been the value of the signal for an 
infinite length of time prior to the start of the simulation. ... 

■ Each ... process in the model is executed until it suspends. ... 

■ The time of the next simulation cycle (which in this case is the first simula- 
tion cycle), Tn is calculated according to the rules of step f of the simulation 
cycle, below. 

Each simulation cycle starts with setting the current time to the next time at 
which changes must be considered. This time T„ was either computed during 
the initialization or during the last execution of the simulation cycle. Simu- 
lation terminates when the current time reaches its maximum, T IME'HIGH . 
According to the original document, the simulation cycle is described as fol- 
lows: A simulation cycle consists of the following steps: 



*We leave out the discussion of implicitly declared signals and so-called postponed processes introduced in 
the 1997 version of VHDL. 

^In order not to get lost in the amount of details provided by the standard, some of its sections (indicated by 
are omitted in the citation. 
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a) The current time, Tc is set equal to T^. Simulation is complete when = 
T IME'HIGH and there are no active drivers or process resumptions at Tn- 

b) Each active explicit signal in the model is updated. (Events may occur as a 
result.) ... 

This phrase from the document refers to the fact that in the cycle preceeding 
the current cycle, new future values for some of the signals have been com- 
puted. If Tc corresponds to the time at which these values become valid, 
they are now assigned. Note that new values of signals are never immedi- 
ately assigned while executing a simulation cycle. They are not assigned 
before the next simulation cycle, at the earliest. Signals that change their 
value generate so-called events which, in-turn, may enable the execution of 
processes that are sensitive to that signal. 

c) Eor each process P, if P is currently sensitive to a signal S and if an event 
has occurred on S in this simulation cycle, then P resumes. 

d) Each... process that has resumed in the current simulation cycle is executed 
until it suspends. 

e) The time of the next simulation cycle, Tn is determined by setting it to the 
earliest of 

1 TIME’EUGEI (This is the end of simulation time). 

2 The next time at which a driver becomes active (this is the next instance 
in time, at which a driver specifies a new value), or 

3 The next time at which a process resumes (this time is determined by 
wait on statements). 

IfTn = Tc, then the next simulation cycle (if any) will be a delta cycle. 

The iterative nature of simulation cycles is shown in fig. 2.58. 

Start of simulation 

Future values for signal drivers 




Assign new values to signals Evaluate processes 




Activate all processes sensitive to signal changes 
Figure 2.58. VHDL simulation cycles 
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Delta (6) simulation cycles have been the source of many discussions. Their 
purpose is to introduce a infinitesimally small delay even in cases in which the 
user did not specify any. As an example, we will show the effect of these cycles 
using a flip-flop as an example. Fig. 2.59 shows the schematic of the flip-flop. 




Figure 2.59. RS-Flipflop 

The flip-flop is modeled in VHDL as follows: 
entity RS_Flipflop is 
port (R: in BIT; -- reset 
S: in BIT; -- set 
Q: inout BIT; -- output 
nQ: inout BIT; -- Q-bar 
); 

end RS_Flipflop; 

architecture one of RS.Flipflop is 
begin 

process: (R,S,Q,nQ) 

begin 

Q <= R nor nQ; 
nQ <= S nor Q; 

end process; 

end one; 

Ports Q and nQ must be of mode inout since they are also read internally, which 
would not be possible if they were of mode out. Fig. 2.60 shows the simulation 
times at which signals are updated for this model. 

Simulation terminates after two 5 cycles. 5 cycles correspond to an infinitesi- 
mally small unit of time, which will always exist in reality. 5 cycles ensure that 
simulation respects causality and that the results do not depend on the order in 
which parts of the model are executed by the simulation. Otherwise, simula- 
tion would become non-deterministic, which is not what we expect from the 
simulation of a real circuit with deterministic behavior. There can be arbitrar- 
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Figure 2.60. 5 cycles for RS-flip-fiop 



ily many 6 cycles before the current time Tc is advanced. This possibility of 
infinite loops can be confusing. One of the options of avoiding this possibility 
would be to disallow zero delays, which we used in our model of the flip-flop. 

A very important concept of VHDL is the separation between the computation 
of new values for signals and their actual assignment. This separation enables 
deterministic simulation results. In a model containing the lines 

a <= b; 
b <= a; 

signals a and b will always be swapped. If the assignments were performed 
immediately, the result would depend on the order in which we execute the 
assignments (see also page 25). 

2.12 SystemC 

Due to the trend of implementing more and more functionality in software, 
a growing number of embedded systems includes a mixture of hardware and 
software. Most of the embedded system software is specified in C. For exam- 
ple, embedded systems implement standards such as MPEG 1/2/4 or decoders 
for mobile phone standards such as GSM. The standards are frequently avail- 
able in the form of “reference implementations”, consisting of C programs not 
optimized for speed but providing the required functionality. The disadvantage 
of design methodologies based on VHDL or Verilog is the fact that these stan- 
dards have to be rewritten in order to generate hardware. Furthermore, simulat- 
ing hardware and software together requires to interfacing software and hard- 
ware simulators. Typically, this involves a loss of simulation efficiency and 
inconsisfenf user inferfaces. Also, designers have fo learn several languages. 

Therefore, fhere has been a search for techniques for representing hardware 
sfrucfures in soflware languages. Some fundamenlal problems have fo be 
solved before hardware can be modeled wifh soflware languages: 



■ Concurrency, as if is found in hardware, has fo be modeled in soflware. 
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■ There has to be a representation for simulation time. 



■ Multiple-valued logic and resolution as described earlier must be sup- 
ported. 



■ The deterministic behavior of almost all useful hardware circuits must be 
guaranteed. 



SystemC^^ [SystemC, 2002] is a C-i-i- class library designed to solve these 
problems. With SystemC, specifications can be written in C or C-i-i-, making 
appropriate references to the class libraries. 

SystemC comprises a notion of processes executed concurrently. Simulation 
semantics are similar to VHDL, including the presence of delta cycles. The 
execution of these processes is controlled via sensivity lists and calls to wait 
primitives. The sensivity list concept of VHDL has been extended to also 
include dynamic sensivity lists (in SystemC 2.0). 

SystemC includes a model of time. SystemC 1.0 uses floating point numbers to 
denote time. In SystemC 2.0, an integer model of time is preferred. SystemC 
2.0 also supports physical units such as picoseconds, nanoseconds, microsec- 
onds etc. 

SystemC data types include all common hardware types: four-valued logic 
(’O’, ’1’, ’X’ and ’Z’) and bitvectors of different lengths are supported. Writing 
digital signal processing applications is simplified due to the availability of 
fixed-point data types. 

Deterministic behavior (see page 27) is not guaranteed in general, unless a 
certain modeling style is used. Using a command line option, the simulator 
can be directed to run processes in different orders. This way, the user can 
check if the simulation results depend on the sequence in which the processes 
are executed. However, for models of realistic complexity, only the presence 
of non-deterministic behavior can be proved, not its absence. 

Reusing hardware components in different contexts is simplified by the sep- 
aration of computation and communication. SystemC 2.0 provides channels, 
ports and interfaces as abstract components for communication. 

SystemC has the potential for replacing existing VHDL-based design flows, 
even though hardware synthesis from SystemC only starts becoming available 
at the time of the writing of this book [Herrera et ah, 2003a], [Herrera et ah, 
2003b]. Methodology and applications for SystemC -based design is described 
in a book on that topic [Muller et ah, 2003]. 
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Verilog [Thomas and Moorby, 1991] is another hardware deseription language. 
Initially it was a proprietary language, but it was later standardized as IEEE 
standard 1364, with versions called IEEE standard 1364-1995 (Verilog version 
1.0) and IEEE standard 1394-2001 (Verilog 2.0). Some features of Verilog are 
quite similar to VHDE. Just like in VHDE, designs are described as a set of 
connected design entities, and design entities can be described behaviorally. 
Also, processes are used to model concurrency of hardware components. Just 
like in VHDE, bitvectors and time units are supported. There are, however, 
some areas in which Verilog is less flexible and focuses more on comfortable 
built-in features. Eor example, standard Verilog does not include the flexible 
mechanisms for defining enumerated types like the ones defined in the IEEE 
1164 standard. However, Verilog support for four- valued logic is built into the 
language, and the standard IEEE 1364 also provides multiple valued logic with 
8 different signal strengths. Multiple-valued logic is more tightly integrated 
into Verilog than into VHDE. The Verilog logic system also provides more 
features for transistor-level descriptions. However, VHDE is more flexible. 
Eor example, VHDE allows hardware entities to be instantiated in loops. This 
can be used to generate a structural description for, e.g. n-bit adders without 
having to specify n adders and their interconnections manually. 

Verilog has a similar number of users as VHDE. While VHDE is more popular 
in Europe, Verilog is more popular in the US. 

Verilog versions 3.0 and 3.1 are also known as SystemVerilog. They include 
numerous extensions to Verilog 2.0. These extensions include [Accellera, 
2005]: 

■ additional language elements for modeling behavior, 

■ C data types such as int and type definition facilities such as typedef and 
struct, 

■ definition of interfaces of hardware components as separate entities, 

■ standardized mechanism for calling C/C-i-i- functions and, to some extend, 
to call built-in Verilog functions from C, 

■ significantly enhanced features for describing an environment (called test- 
bench) for the hardware under design (called CUD), and for using the test- 
bench to verify the CUD by simulation, 

■ classes known from object-oriented programming for use within testben- 
ches. 
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■ dynamic process creation, 

■ standardized interprocess communication and synchronization, including 
semaphores, 

■ automatic memory allocation and deallocation, 

■ language features that provide a standardized interface to formal verifica- 
tion (see page 199). 

Due to the capability of interfacing with C and C-i-i-, interfacing to SystemC 
models is also possible. Improved facilities for simulation- as well as for for- 
mal verification-based design validation and the possible interfacing to Sys- 
temC will potentially create a very good acceptance of System Verilog. 

2.14 SpecC 

The SpecC language [Gajski et ah, 2000] is based on the clear separation be- 
tween communication and computation that should be used for modeling em- 
bedded systems. This separation paves the way for re-using components in 
different contexts and enables plug-and-play for system components. SpecC 
models systems as hierarchical networks of behaviors communicating through 
channels. SpecC descriptions consist of behaviors, channels and interfaces. 
Behaviors include ports, locally instantiated components, private variables and 
functions and a public main function. Channels encapsulate communication. 
They include variables and functions, which are used for the definition of a 
communication protocol. Interfaces are linking behaviors and channels to- 
gether. They declare the communication protocols which are defined in a chan- 
nel. 

SpecC can model hierarchies wifh nested behaviors. Fig. 2.61 [Gajski el ah, 
2000] shows a componenl B including sub-componenls b1 and b2. 




Figure 2.61. Structural hierarchy of SpecC example 
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The sub-components are communicating through integer c1 and through chan- 
nel c2. The structural hierarchy includes b1 and b2 as the leaves. b1 and b2 are 
executed concurrently, denoted by the keyword par in SpecC. This structural 
hierarchy is described in the following SpecC model. 

interface L {void Write(int x); }; 
interface R {int Read(void); }; 
channei C impiements L,R 
(int Data; booi Valid; 
void Write(int x) |Data=x; Valid=true;} 
int Read (void) 

(whiie (IValid) waitfor (10); return (Data);} } 
behavior B1 (in int pi , L p2, in int p3) 

(void main (void) {/* ...*/ p2.Write(p1);| }; 
behavior B2 (out int pi , R p2, out int p3) 

(void main(void) {/*...*/ p3=p2.Read(); } }; 
behavior B(in int pi, out int p2) 

(int cl; Cc2; B1 b1(p1, c2, cl); B2 b2(c1, c2, p2); 
void main (void) 

{par {bl.mainQ; b2.main();}} 

}; 

Note that the interface protocol implemented in channel C, consisting of meth- 
ods for read and write operations, can be changed without changing behaviors 
B1 and B2. For example, communication can be bit-serial or parallel and the 
choice does not affect the models of B1 and B2. This is a necessary feature for 
IP-reuse. 

In order to simplify designs containing software and hardware components, the 
syntax of SpecC is based on C and C-i-i-. In fact, SpecC models are translated 
into C-t-i- for simulation. 

The communication model of SpecC has inspired communication in SystemC 

2 . 0 . 

2.15 Additional languages 

A large number of languages has been designed for embedded applications. 
The following is a list of some of them: 
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■ Pearl: Pearl [Deutsches Institut fiir Normung, 1997] was designed for in- 
dustrial control applications. It does include a large repertoire of language 
elements for controlling processes and referring to time. It requires an un- 
derlying real-time operation system. Pearl has been very popular in Europe 
and a large number of industrial control projects has been implemented in 
Pearl. 

■ Chill: Chill [Winkler, 2002] was designed for telephone exchange stations. 
It was standardized by the CCITT and used in telecommunication equip- 
ment. Chill is a kind of extended PASCAL. 

■ lEC 60848, STEP 7 : lEC 60848 [lEC, 2002] and STEP 7 [Berger, 2001] 
are two languages that are used in control applications. Both provide graph- 
ical elements for describing the system functionality. 

■ SpecCharts: SpecCharts [Gajski et ah, 1994], as a predecessor to SpecC, 
combines the advantages of StateCharts and VHDL. It is based on the State- 
Chart modeling paradigm of automata, but allows their behavior to be de- 
scribed in VHDL. In addition, a distinction between transitions to be taken 
immediately and transitions to be taken after the completion of all com- 
putations is made. The first type of transitions simplifies fhe modeling 
of exceptions. Jusf like SfateCharts, SpecCharts has problems describing 
structural hierarchies. 

■ Estelle: This language was designed to describe communication protocols. 
Similar to SDL, Estelle assumes communication via channels and EIEO- 
buffers. Attempts to unify Estelle and SDL failed. 

■ LOTOS, Z: These languages [Jeffrey and Leduc, 1996], [Spivey, 1992] 
are algebraic specification languages enabling precise specifications and 
formal proofs. Unfortunately, they are not executable and hence cannot be 
used for early design validations. 

■ Silage: This language is tailored towards digital signal processing. It is 
a functional language, paving the way for getting rid of sequential, von- 
Neumann style of programming. Unfortunately, it has not been accepted 
by designers. 

■ Rosetta: The development of the Rosetta language is part of the activities 
of the Accellera initiative. Accellera ’s mission is to drive worldwide devel- 
opment and use of standards required by systems, semiconductor and de- 
sign tools companies, which enhance a language-based design automation 
process [Accellera, 2002]. The Accellera initiative also works on updating 
VHDL and Verilog standards. 
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■ Esterel: The definition of Esterel eomprises the following salient features 
[Boussinot and de Simone, 1991]: Esterel is a reactive language: when ac- 
tivated with an input event, Esterel models react by producing an output 
event. Esterel is a synchronous language: all reactions are assumed to be 
completed in zero time and it is sufficient to analyze the behavior at discrete 
moments in time. This idealized model avoids all discussions about over- 
lapping time ranges and about events that arrive while the previous reaction 
has not been completed. As other concurrent languages, Esterel has a par- 
allelism operator, written 1 1 . Eike in StateCharts, communication is based 
on a broadcast mechanism. In contrast to StateCharts, however, communi- 
cation is instantaneous. This means that all signals generated at a particular 
moment in time are also seen by the others parts of the model at the same 
moment in time and these other parts, if sensitive to the generated signals, 
react at the same moment in time. Several rounds of evaluations may be 
required until a stable state (if any) is reached. This propagation of values 
during the same macroscopic instant of time corresponds to the 5-cycles of 
VHDE and the generation of a next status for the same moment in time in 
StateCharts, except that the broadcast is now instantaneous. Instantaneous 
broadcast can lead to instantaneous loops. These loops are detected by the 
Esterel compiler, but not all infinite loops can be detected by the compiler. 
Eor more and updated information about Esterel, refer to a web site [Es- 
terel, 2002]. 

■ MATLAB/Simulink: MATEAB/Simulink [Tewari, 2001] is a modeling 
and simulation tool based on mathematics. Actual systems can be de- 
scribed, for example, in the form of partial differential equations. This ap- 
proach is appropriate for modeling physical systems such as cars or trains 
and then simulating the behavior of these systems. Also, digital signal 
processing systems can be conveniently modeled with MATEAB. In order 
to generate implementations, MATEAB/Simulink models first have to be 
translated into a language supported by software or hardware design sys- 
tems, such as C or VHDE. 

2.16 Levels of hardware modeling 

In practice, designers start design cycles at various levels of abstraction. In 
some cases, these are high levels describing the overall behavior of the system 
to be designed. In other cases, the design process starts with the specification 
of electrical circuits at lower levels of abstraction. Eor each of the levels, a 
variety of languages exists, and some languages cover at various levels. In the 
following, we will describe a set of possible levels. Some lower end levels 
are presented here for context reasons. Specifications should not start at those 
levels. The following is a list of frequently used names and attributes of levels: 
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■ System level models: The term system level is not elearly defined. It is 
used here to denote the entire embedded system and the system into whieh 
information proeessing is embedded (“the produet”), and possibly also the 
environment (the physieal input to the system, reflecting e.g. the roads, 
weather conditions etc.). Obviously, such models include mechanical as 
well as information processing aspects and it may be difficult to find ap- 
propriate simulators. Possible solutions include VHDL-AMS (the analog 
extension to VHDL), SystemC or MATLAB. MATLAB and VHDL-AMS 
support modeling partial differential equations, which is a key requirement 
for modeling mechanical systems. It is a challenge to model information 
processing parts of the system in such a way that the simulation model 
can also be used for the synthesis of the embedded system. If this is not 
possible, error-prone manual translations between different models may be 
needed. 

■ Algorithmic level: At this level, we are simulating the algorithms that we 
intend to use within the embedded system. For example, we might be sim- 
ulating MPEG video encoding algorithms in order to evaluate the resulting 
video quality. For such simulations, no reference is made to processors or 
instruction sets. 

Data types may still allow a higher precision than the final implementation. 
For example, MPEG standards use double precision floating point numbers. 
The final embedded system will hardly include such data types. If data 
types have been selected such that every bit corresponds to exactly one bit 
in the final implementation, the model is said to be bit-true. Translating 
non-bit-true into bit-true models should be done with tool support (see page 
158 ). 

Models at this level may consist of single processes or of sets of cooperating 
processes. 

■ Instruction set level: In this case, algorithms have already been compiled 
for the instruction set of the processor(s) to be used. Simulations at this 
level allow counting the executed number of instructions. There are several 
variations of the instruction set level: 

- In a coarse-grained model, only the effect of the instructions is sim- 
ulated and their timing is not considered. The information available 
in assembly reference manuals (instruction set architecture (ISA)) is 
sufficient for defining such models. 

- Transaction level modeling: In transaction level modeling, transac- 
tions, such as bus reads and writes, and communication between differ- 
ent components is modeled. Transaction level modeling includes less 
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details than cycle-true modeling (see below), enabling significantly su- 
perior simulation speeds [Clouard et ab, 2003]. 

- In a more fine-grained model, we might have cycle-true instruction 
set simulation. In this case, the exact number of clock cycles required 
to run an application can be computed. Defining cycle-true models re- 
quires a detailed knowledge about processor hardware in order to cor- 
rectly model, for example, pipeline stalls, resource hazards and mem- 
ory wait cycles. 

■ Register-transfer level (RTL): At this level, we model all the components 
at the register-transfer level, including arithmetic/logic units (ALUs), regis- 
ters, memories, muxes and decoders. Models at this level are always cycle- 
true. Automatic synthesis from such models is not a major challenge. 

■ Gate-level models: In this case, models contain gates as the basic compo- 
nents. Gate-level models provide accurate information about signal tran- 
sition probabilities and can therefore also be used for power estimations. 
Also delay calculations can be more precise than for the RTL. However, 
typically no information about the length of wires and hence no informa- 
tion about capacitances is available. Hence, delay and power consumption 
calculations are still estimates. 

The term “gate-level model” is sometimes also employed in situations in 
which gates are only used to denote Boolean functions. Gates in such a 
model do not necessarily represent physical gates; we are only considering 
the behavior of the gates, not the fact that they also represent physical com- 
ponents. More precisely, such models should be called “Boolean function 
models” but this term is not frequently used. 

■ Switch-level models: Switch level models use switches (transistors) as 
their basic components. Switch level models use digital values models 
(refer to page 62 for a description of possible value sets). In contrast to 
gate-level models, switch level models are capable of reflecting bidirec- 
tional transfer of information. 

■ Circuit-level models: Circuit theory and its components (current and volt- 
age sources, resistors, capacitances, inductances and possibly macro-models 
of semiconductors) form the basis of simulations at this level. Simulations 
involve partial differential equations. These equations are linear if and only 
if the behavior of semiconductors is linearized (approximated). The most 
frequently used simulator at this level is SPICE [Vladimirescu, 1987] and 
its variants. 



*®These models could be represented with binary decision diagrams (BDDs) [Wegener, 2000]. 
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■ Layout models: Layout models reflect the actual circuit layout. Such mod- 
els include geometric information. Layout models cannot be simulated 
directly, since the geometric information does not directly provide infor- 
mation about the behavior. Behavior can be deduced by correlating the 
layout model with a behavioral description at a higher level or by extract- 
ing circuits from the layout, using knowledge about the representation of 
circuit components at the layout level. In a typical design flow, the length of 
wires and the corresponding capacitances are extracted from the layout and 
back-annotated to descriptions at higher levels. This way, more precision 
can be gained for delay and power estimations. 

■ Process and device models: At even lower levels, we can model fabri- 
cation processes. Using information from such models, we can compute 
parameters (gains, capacitances etc) for devices (transistors). 

2.17 Language comparison 

None of the languages presented so far meets all the requirements for specifi- 
cation languages for embedded systems. Fig. 2.62 presents an overview over 
some of the key properties of some of the languages. Exceptions and dynamic 
process creation are supposed to be supported in SystemC 3.0. 
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Figure 2. 62. Language comparison 



It is not very likely that a single language will ever meet all requirements, since 
some of the requirements are essentially conflicting. A language supporting 
hard real-time requirements well may be inconvenient to use for less strict real- 
time requirements. A language appropriate for distributed control-dominated 
applications may be poor for local data-flow dominated applications. Hence, 
we can expect that we will have to live with compromises. 
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Which compromises are actually used in practice? In practice, assembly lan- 
guage programming was very common in the early years of embedded systems 
programming. Programs were small enough to handle the complexity of prob- 
lems in assembly languages. The next step is the use of C or derivatives of C. 
Due to the ever increasing complexity of embedded system software (see page 
9), higher level languages are to follow the introduction of C. Object oriented 
languages and SDL are languages which provide the next level of abstraction. 
Also, languages like UML are required to capture specifications at an early 
design stage. In practice, these languages can be used like shown in fig. 2.63. 




Figure 2. 63. Using various languages in combination 

According fo fig. 2.63, languages like SDL or SfafeCharfs can be franslafed 
info C. These C descriptions are fhen compiled. Sfarfing wifh SDL or Sfafe- 
Charf also opens fhe way fo implementing fhe funclionalily in hardware, if 
franslafors from fhese languages fo VHDL are provided. Bofh C and VHDL 
will cerfainly survive as infermediafe languages for many years. Java does nol 
need infermediafe sfeps buf does also benefil from good franslafion concepfs fo 
assembly languages. 

2.18 Dependability requirements 

The previous secfions have mosfly focused on fhe specificafion of fhe func- 
fional behavior of fhe system fo be designed. However, fhere may be safely- 
crifical sysfems, for which safely requiremenls are aclually fhe dominating re- 
quiremenl. Safely requiremenls cannol come in as an aflerlhoughl, buf have 
fo be considered righf from fhe beginning. The design of safe and dependable 
sysfems is a topic by ils own. This book can only provide a few hinls info Ihis 
direclion. 
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According to Kopetz [Kopetz, 2003], the following must be taken into account: 
For safety-critical systems, the system as a whole must be more dependable 
than any of its parts. Allowed failure may be in the order of 1 failure per 
10^ hours. This may be in the order of 1000 times less than typical failure 
rates of chips. Obviously, fault-tolerance mechanisms must be used. Due to 
the low acceptable failure rate, systems are not 100% testable. Instead, safety 
must be shown by a combination of testing and reasoning. Abstraction must 
be used to make the system explainable using a hierarchical set of behavioral 
models. Design faults and human failures must be taken into account. In 
order to address these challenges, Kopetz proposed the following twelve design 
principles: 

1 Safety considerations may have to be used as the important part of the 
specification, driving the entire design process. 

2 Precise specifications of design hypotheses must be made right at the be- 
ginning. These include expected failures and their probability. 

3 Fault containment regions (FCRs) must be considered. Faults in one FCR 
should not affect other FCRs. 

4 A consistent notion of time and state must be established. Otherwise, it will 
be impossible to differentiate between original and follow-up errors. 

5 Well-defined inferfaces have fo hide fhe infernals of componenfs. 

6 If musf be ensured fhaf componenfs fail independenfly. 

7 Componenfs should consider fhemselves fo be correcf unless fwo or more 
ofher componenfs prefend fhe confrary fo be frue (principle of self-confi- 
dence). 

8 Faulf folerance mechanisms musf be designed such fhaf fhey do nof creafe 
any additional difficully in explaining fhe behavior of fhe system. Faulf 
folerance mechanisms should be decoupled from fhe regular function. 

9 The system musf be designed for diagnosis. For example, if has fo be pos- 
sible fo idenlifying existing (buf masked) errors. 

10 The man-machine interface musf be infuifive and forgiving. Safely should 
be mainfained despite misfakes made by humans. 

1 1 Every anomaly should be recorded. These anomalies may be unobservable 
af fhe regular interface level. This recording should involve internal effecls, 
since ofherwise fhey may be masked by faull-lolerance mechanisms. 
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12 Provide a never-give up strategy. Embedded systems may have to provide 
uninterrupted service. The generation of pop-up windows or going offline 
is unacceptable. 

For further information about dependability and safety issues, contact books 
[Laprie, 1992], [Neumann, 1995], [Leveson, 1995], [Storey, 1996], [Geffroy 
and Motet, 2002] on those areas. 




Chapter 3 



EMBEDDED SYSTEM HARDWARE 



3.1 Introduction 

It is one of the eharacteristies of embedded systems that both hardware and 
software must be taken into aeeount. The reuse of available hard- and software 
eomponents is at the heart of the proposed platform-based design method- 
ology. The methodology will be deseribed starting at page 151. Consistent 
with the need to consider available hardware components and with the design 
information flow shown in fig. 3.1, we are now going to describe some of the 
essentials of embedded system hardware. 




Figure 3.1. Simplified design information flow 

Hardware for embedded systems is much less standardized than hardware for 
personal computers. Due to the huge variety of embedded system hardware, it 
is impossible to provide a comprehensive overview over all types of hardware 
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components. Nevertheless, we will try to provide an overview over some of 
the essential components which can be found in most systems. 

In many of the embedded systems, especially in control systems, embedded 
hardware is used in a loop (see fig. 3.2). 




Figure 3.2. Hardware in the loop 

In this loop, information about the physical environment is made available 
through sensors. Typically, sensors generate continuous sequences of analog 
values. In this book, we will restrict ourselves to information processing in 
digital computers processing discrete sequences of values. Appropriate con- 
versions are performed by two kinds of circuits: sample-and-hold-circuits and 
analog-to-digital (A/D) converters. After this conversion, information can be 
processed digitally. Generated results can be displayed and also be used to 
control the physical environment through actuators. Since most actuators are 
analog actuators, conversion from digital to analog signals is also needed. 

This model is obviously appropriate for control applications. For other appli- 
cations, it can be employed as a first order approximation. For example, in 
mobile phones, sensors correspond to the antenna and actuators correspond to 
the speakers. In the following, we will describe essential hardware components 
of embedded systems following the loop structure of fig. 3.2. 

3.2 Input 
3.2.1 Sensors 

We start with a brief discussion of sensors. Sensors can be designed for virtu- 
ally every physical quantity. There are sensors for weight, velocity, accelera- 
tion, electrical current, voltage, temperatures etc. A large amount of physical 
effects can be used for constructing sensors [Elsevier B.V., 2003a]. Examples 
include the law of induction (generation of voltages in an electric field), or 
lighf-elecfric effecls. Also, fhere are sensors for chemical subsfances [Elsevier 
B.V., 2003b]. 

In recenf years, a huge amounf of sensors has been designed and much of 
fhe progress in designing smarf systems can be affribuled fo modern sensor 
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technology. Hence, it is impossible to cover this subset of embedded hardware 
technology comprehensively and we can only give characteristic examples: 

■ Acceleration sensors: Fig. 3.3 shows a small sensor manufactured using 
microsystem technology. The sensor contains a small mass in its center. 
When accelerated, the mass will be displaced from its standard position, 
thereby changing the resistance of the tiny wires connected to the mass. 




Figure 3.3. Acceleration sensor (courtesy S. Biitgenbach, IMT, TU Braunschweig), (c)TU 
Braunschweig, Germany 



■ Rain sensors: In order to remove distraction from drivers, some recent 
high end cars contain rain sensors. Using these, the speed of the wipers can 
be automatically adjusted to the amount of rain. 

■ Image sensors: There are essentially two kinds of image sensors: charge- 
coupled devices (CCDs) and CMOS sensors. In both cases, arrays of light 
sensors are used. The architecture of CMOS sensor arrays is similar to that 
of standard memories: individual pixels can be randomly addressed and 
read out. CMOS sensors use standard CMOS technology for integrated 
circuits [Dierickx, 2000]. Due to this, sensors and logic circuits can be 
integrated on the same chip. This allows some preprocessing to be done 
already on the sensor chip, leading to so-called smart sensors. CMOS sen- 
sors require only a single standard supply voltage and interfacing in gen- 
eral is easy. Therefore, CMOS-based sensors can be cheap. In contrast, 
CCD technology is optimized for optical applications. In CCD technol- 
ogy, charges have to be transfered from one pixel to the next until they can 
finally be read out at an array boundary. This sequential charge transfer 
also gave CCDs their name. Images generated with CCDs can be of higher 
quality than those generated using CMOS sensors, since they generate less 
noise. However, interfacing is more complex. As a result, CMOS sensors 
are appropriate for applications requiring low or medium costs and low or 
medium image quality. CCD sensors are more adequate for high quality. 
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expensive image sensors (sueh as those found in video cameras and optical 
telescopes, for example). 

■ Bio-metrical sensors: Demands for higher security standards as well as 
the need to protect mobile and removable equipment have led to an in- 
creased interest in authentication. Due to the limitations of password based 
security (e.g. stolen and lost passwords), smartcards, bio-metrical sensors 
and bio-medical authentication receive significant attention. Bio-medical 
authentication tries to identify whether or not a certain person is actually 
the person she or he claims to be. Methods for bio-medical authentication 
include iris scans, finger prinf sensors and face recognifion. Finger prinf 
sensors are fypically fabricafed using fhe same CMOS fechnology [Wesfe 
ef ah, 2000] which is used for manufacfuring infegrafed circuifs. Possible 
applications include nofebooks which granf access only if fhe user’s finger 
prinf is recognized [IBM Inc., 2002]. CCD and CMOS image sensors de- 
scribed above are used for face recognition. False accepfs as well as false 
rejecfs are an inherenf problem of bio-medical aulhenficafion. In confrasl 
fo password based aulhenficafion, exacl malches are nol possible. 

■ Artifical eyes: Arfificial eye projecfs have received significanl affenfion. 
While some projecfs alfempl lo aclually affecl fhe eye, olhers fry lo provide 
vision in an indirecl way. The Dobelle Inslilufe is experimenling wilh a 
selup in which a liflle camera is aflached lo glasses. This camera is con- 
nected lo a compufer franslaling Ihese patterns info eleclrical pulses. These 
pulses are Ihen sen! direclly lo fhe brain, using a direcl conlacf fhrough an 
elecfrode. Currenlly (2003), fhe resolufion is in fhe order of 128 by 128 pix- 
els, enabling blind persons lo drive a car in conlrolled areas [The Dobelle 
Inslilufe, 2003]. 

■ Other sensors: Olher common sensors include: pressure sensors, proxim- 
ily sensors, engine conlrol sensors. Hall effecl sensors, and many more. 

3.2.2 Sample-and-hold circuits 

All known digilal computers work in fhe discrete lime domain. This means 
Ihey can process discrele sequences of values. Hence, values in fhe conlinu- 
ous domain have lo be converled lo fhe discrete domain. This is fhe purpose 
of sample-and-hold circuifs. Fig. 3.4 (lefl) shows a simple sample-and-hold- 
circuil. 

In essence, fhe circuil consisls of a clocked fransislor and a capacitor. The 
fransislor operates tike a swilch. Each lime fhe swilch is closed by fhe clock 
signal, fhe capacitor is charged so lhal ils vollage is practically fhe same as 
fhe incoming vollage Vg. After opening Ihe swilch again. Ibis vollage will 
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Figure 3. 4- Sample-and-hold-circuit 



remain essentially unchanged until the switch is closed again. Each of the 
values stored on the capacitor can be considered as an element of a discrete 
sequence of values Vx, generated from a continuous sequence Ve (see fig. 3.4, 
right). 

An ideal sample-and-hold circuit would be able to change the voltage at the 
capacitor in an arbitrarily short amount of time. This way, the input voltage at a 
particular instance in time could be transfered to the capacitor and each element 
in the discrete sequence would correspond to the input voltage at a particular 
point in time. In practice, however, the transistor has to be kept closed for a 
short time window in order to really charge or discharge the capacitor. The 
voltage stored on the capacitor will then correspond to a voltage averaged over 
that short time window. 

3.2.3 A/D-converters 

Since we are restricting ourselves to digital computers, we will also have to 
work with discrete values representing our input signals. The conversion from 
analog to digital values is done by analog-to-digital (A/D) converters. There is 
a large range of A/D converters with varying speed/precision characteristics. 
In this book, we will present two extreme cases: 

■ Flash A/D converter: This type of A/D converters uses a large number of 
comparators. Each comparator has two inputs, denoted as + and -. If the 
voltage at input + exceeds that at input -, the output corresponds to a logical 
’1 ’ and it corresponds to a logical ’0’ otherwise^ . 

In the A/D-converter, all - inputs are connected to a voltage divider. Now, if 
input voltage Vx exceeds constant voltage V^ef, the comparator at the top of 
fig. 3.5 will generafe a ’1 ’. The encoder af fhe oufpuf of fhe comparafors will 



*In practice, the case of equal voltages is not relevant, as the actual behavior for very small differences 
between the voltages at the two inputs depends on many factors (like temperatures, manufacturing processes 
etc.) anyway. 
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try to identify the most significant ’1 ’ and will encode the case of Vx > Vref 
as the largest output value. 




Digital 

outputs 



Now if input voltage Vx is less than Vref, but still larger than |Vref, the 
comparator at the top of fig. 3.5 will generafe a ’O’, while fhe nexf compara- 
tor will still signal a ’1’. The encoder will encode fhis as fhe second-largesf 
value. 

Similar argumenfs hold for cases |Vref < Vx < |Vref, | Vref < Vx < |Vref, 
and 0 < Vx < ^Vref, which will be encoded as fhe fhird-largesf, fourlh- 
largesf and smallesf value, respecfively. 

The circuif can converf positive analog inpuf volfages into digifal values. 
Converfing bofh positive and negafive volfages requires some exfensions. 

The key advanfage of fhe circuif is ifs speed. If does nof need any clock. 
The delay befween fhe inpuf and fhe oufpuf is very small and fhe circuif can 
be used easily, for example, for high-speed video applicafions. The disad- 
vanfage is ifs hardware complexify: we need n — \ comparafors in order to 
disfinguish befween n values. Imagine using fhis circuif in generafing digi- 
fal audio signals for CD recorders. We would need 2^^ — 1 comparators! 

■ Successive approximation: Distinguishing befween a large number of dig- 
ifal values is possible wifh A/D converfers using successive approximation. 
The circuif is shown in fig. 3.6. 

The key idea of fhis circuif is to use binary search. Initially, fhe mosf sig- 
nificanl oufpuf bif of fhe successive approximation regisfer is sef to ’1’, all 
ofher bifs are sef fo ’O’. This digifal value is fhen converfed fo an analog 
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Figure 3. 6. Circuit using successive approximation 



value, corresponding to 0.5 x the maximum input voltage^. If Vx exceeds 
the generated analog value, the most significant bit is kept at ’1’, otherwise 
it is reset to ’O’. 

This process is repeated with the next bit. It will remain set to ’T if the 
input value is either within the second or the fourth quarter of the input 
value range. The same procedure is repeated for all the other bits. 

The key advantage of the successive approximation technique is its hard- 
ware efficiency. In order fo disfinguish befween n digifal values, we need 
log 2 (n) bifs in fhe successive approximation regisfer and fhe D/A converter. 
The disadvanfage is ifs speed, since if needs 0{log2{n)) steps. These con- 
verfers can fherefore be used for applications, where high precision con- 
versions af moderafe speeds are required. Examples include audio applica- 
tions. 



There are several ofher types of A/D-converfers. Techniques for aufomafically 
selecfing fhe mosf appropriafe converfer exisf [Vogels and Gielen, 2003]. 

3.3 Communication 

Information musf be available before if can be processed in an embedded sys- 
fem. Information can be communicated fhrough various channels. Channels 
are absfracl entities characterized by fhe essenfial properties of communica- 
fion, like maximum information Iransfer capacify and noise paramefers. The 
probability of communication errors can be computed using communication 
theory techniques. The physical entities enabling communication are called 
communication media. Important media classes include: wireless media (ra- 
dio frequency media, infrared), optical media (fibers), and wires. 



^Fortunately, the conversion from digital to analog values (D/A-conversion) can be implemented very effi- 
ciently and can be very fast (see page 121). 
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There is a huge variety of communieation requirements between the various 

classes of embedded systems. In general, connecting the different embedded 

hardware components is far from trivial. Some common requirements can be 

identified. 

3.3.1 Requirements 

The following list contains some of the requirements that have to be met: 

■ Real-time behavior: This requirement has far-reaching consequences on 
the design of the communication system. Some of the low-cost solutions 
such as Ethernet fail to meet this requirement. 

■ Efficiency: Connecting different hardware components can be quite expen- 
sive. For example, point to point connections in large buildings are almost 
impossible. Also, it has been found that separate wires between control 
units and external devices in cars significantly add to the cost and the weight 
of the car. With separate wires, it is also very difficult to add components. 
The need of providing cost efficient designs also affects the way in which 
power is made available to external devices. There is frequently the need to 
use a central power supply in order to reduce the cost. 

■ Appropriate bandwidth and communication delay: Bandwidth require- 
ments of embedded systems may vary. It is important to provide sufficient 
bandwidth without making the communication system too expensive. 

■ Support for event-driven communication: Polling-based systems pro- 
vide a very predictable real-time behavior. However, their communication 
delay may be too large and there should be mechanisms for fast, event- 
oriented communication. For example, emergency situations should be 
communicated immediately and should not remain unnoticed until some 
central controller polls for messages. 

■ Robustness: Embedded systems may be used at extreme temperatures, 
close to major sources of electromagnetic radiation etc. Car engines, for 
example, can be exposed to temperatures less than -20 and up to -1-180 de- 
grees Celsius (-4 to 356 degrees Fahrenheit). Voltage levels and clock fre- 
quencies could be affected due to this large variation in temperatures. Still, 
reliable communication must be maintained. 

■ Fault tolerance: Despite all the efforts for robustness, faults may occur. 
Embedded systems should be operational even after such faults. Restarts, 
like the ones found in personal computers, cannot be accepted. This means 
that retries may be required after attempts to communicate failed. A conflict 
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exists with the first requirement: If we allow retries, then it is difficult to 
meet strict real-time requirements. 

■ Maintainability, diagnosability: Obviously, it should be possible to re- 
pair embedded systems within reasonable time frames. 

■ Privacy: Ensuring privacy of confidential information may require the use 
of encryption. 

These communication requirements are a direct consequence of the general 
characteristics of embedded systems mentioned in chapter 1 . Due to the con- 
flicts between some of the requirements, compromises have to be made. For 
example, there may be different communication modes: one high-bandwidth 
mode guaranteeing real-time behavior but no fault tolerance (this mode is ap- 
propriate for multimedia streams) and a second fault-tolerant, low-bandwidth 
mode for short messages that must not be dropped. 

3.3.2 Electrical robustness 

There are some basic techniques for electrical robustness. Digital communi- 
cation within chips is normally using so-called single-ended signaling. For 
single-ended signaling, signals are propagated on a single wire (see fig. 3.8). 




Such signals are represented by voltages with respect to a common ground 
(less frequently by currents). A single ground wire is sufficient for a number 
of single-ended signals. Single ended signaling is very much susceptible to 
external noise. If external noise (originating from, for example, motors being 
switched on) affects the voltage, messages can easily be corrupted. Also, it 
is difficult to establish high-quality common ground signals between a large 
number of communicating systems, due to the resistance (and inductance) on 
the ground wires. This is different for differential signaling. For differential 
signaling, each signal needs two wires (see fig. 3.8). 

Using differenfial signaling, binary values are encoded as follows: If fhe volf- 
age on fhe firsl wire wifh respecf fo fhe second is posifive, fhen Ibis decoded as 
T ’, ofherwise values are decoded as ’O’. The fwo wires will typically be fwisfed 
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to form so-called twisted pairs. There will be local ground signals, but a non- 
zero voltage between the local ground signals does not hurt. Advantages of 
differential signaling include: 

■ Noise is added to the two wires in essentially the same way. The comparator 
therefore removes almost all the noise. 

■ The logic value depends just on the polarity of the voltage between the 
two wires. The magnitude of the voltage can be affected by reflections or 
because of the resistance of the wires; this has no effect on the decoded 
value. 

■ Signals do not generate any currents on the ground wires. Hence, the qual- 
ity of the ground wires becomes less important. 

■ No common ground wire is required. Hence, there is no need to establish a 
high quality ground wiring between a large number of communicating part- 
ners (this is one of the reasons for using differential signaling for Ethernet). 

■ As a consequence of the properties mentioned so far, differential signaling 
allows a larger throughput than single-ended signaling. 

However, differential signaling requires two wires for every signal and it also 
requires negative voltages (unless it is based on complementary logic signals 
using voltages for single-ended signals). 

Differential signaling is used, for example, in standard Ethernet-based net- 
works. 

3.3.3 Guaranteeing real-time behavior 

Most computer networks are based on Ethernet standards. Eor 10 Mbit/s and 
100 Mbit/s versions of Ethernet, there can be collisions between various com- 
munication partners. This means: several partners are trying to communicate 
at about the same time and the signals on the wires are corrupted. Whenever 
this occurs, the partners have to stop communications, wait for some time. 




Embedded System Hardware 



97 



and then retry. The waiting time is chosen at random, so that it is not very 
likely that the next attempt to communicate results in another collision. This 
method is called carrier-sense multiple access/collision detect (CSMA/CD). 
For CSMA/CD, communication time can get huge, since conflicts can repeat a 
large number of times, even though this is not very likely. Hence, CSMA/CD 
cannot be used when real-time constraints have to be met. 

This problem can be solved with CSMA/CA (carrier-sense multiple access/ 
collision avoidance). As the name indicates, collisions are completely avoided, 
rather than just detected. For CSMA/CA, priorities are assigned to all part- 
ners. Communication media are allocated to communication partners during 
arbitration phases, which follow communication phases. During arbitration 
phases, partners wanting to communicate indicate this on the media. Partners 
finding such indications of higher priority have to immediately remove their 
indication. 

Provided that there is an upper bound on the time between arbitration phases, 
CSMA/CA guarantees a predictable real-time behavior for the partner having 
the highest priority. For other partners, real-time behavior can be guaranteed if 
the higher priority partners do not continuously request access to the media. 

Note that high-speed versions of Ethernet (1 Gbit/s) also avoid collisions. 

3.3.4 Examples 

■ Sensor/actuator busses: Sensor/actuator busses provide communication 
between simple devices such as switches or lamps and the processing equip- 
ment. There may be many such devices and the cost of the wiring needs 
special attention for such busses. 

■ Field busses: Field busses are similar to sensor/actuator busses. In general, 
they are supposed to support larger data rates than sensor/actuator busses. 
Examples of field busses include the following: 

- Controller Area Network (CAN): This bus was developed in 1981 by 
Bosch and Intel for connecting controllers and peripherals. It is popular 
in the automotive industry, since it allows the replacement of a large 
amount of wires by a single bus. Due to the size of the automotive 
market, CAN components are relatively cheap and are therefore also 
used in other areas such as smart homes and fabrication equipment. 
CAN has the following properties: 

* differential signaling with twisted pairs, 

* arbitration using CSMA/CA, 

* throughput between lOkbit/s and 1 Mbit/s, 
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* low and high-priority signals, 

* maximum latency of 134 ^s for high priority signals, 

* coding of signals similar to that of serial (RS-232) lines of PCs, 
with modifications for differential signaling. 

- The Time-Triggered -Protocol (TTP) [Kopetz and Gmnsteidl, 1994] 
for fault-tolerant safety systems like airbags in cars. 

- FlexRay [FlexRay Consortium, 2002] is a TDMA (Time Division Mul- 
tiple Access) protocol which has been developed by the FlexRay con- 
sortium (BMW, DaimlerChrysler, General Motors, Ford, Bosch, Mo- 
torola and Philips Semiconductors). FlexRay is a combination of a 
variant of the TTP and the bytellight [Byteflight Consortium, 2003] 
protocol. 

- MAP: MAP is a bus designed for car factories. 

- EIB: The European Installation Bus (EIB) is a bus designed for smart 
homes. 

■ Wireless communication: Wireless communication is becoming more pop- 
ular, but communication bandwidth is becoming a scarce resource. As a 
result, the frequencies reserved for third generation UMTS mobile phones 
have been sold at extremely high prices (at about 500 Euros or dollars per 
person living in Germany). 

Bluetooth is a standard for connecting devices such as mobile phones and 
their headsets. 

The wireless version of Ethernet is standardized as IEEE standard 802.11. 
It is being used in local area networks (EANs). 

DECT is a standard used for wireless phones in Europe. 

HomeRE [palowireless, 2003] is a standard for synchronous wireless trans- 
mission of speech and multimedia data. 

3.4 Processing Units 
3.4.1 Overview 

Eor information processing, we will consider ASICs (application-specific inte- 
grated circuits) using hardwired multiplexed designs, reconfigurable logic, and 
processors. These three technologies are quite different, for example, as far as 
their energy efficiency is concerned. Eig. 3.9 (approximating information pro- 
vided by H. De Man and Th. Claasen [De Man, 2002]) shows the number of 
operations per Watt that can be achieved with a certain hardware technology. 
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Figure 3. 9. Hardware efficiency 



Obviously, the number of operations per Watt is increasing as technology ad- 
vances to smaller and smaller feature sizes of integrated circuits. However, for 
any given technology, the number of operations per Watt is largest for applica- 
tion specific hardwired circuits. For reconfigurable logic (see page 115), this 
value is about one order of magnitude lower. For programmable processors, it 
is about two orders of magnitude lower. On the other hand, processors offer the 
largest amount of flexibility, resulting from the flexibility of software. There 
is also some flexibility for reconfigurable logic, but it is limited to the size of 
applications that can be mapped to such logic. For hardwired designs, there 
is no flexibility. This observation also applies for processors: For processors 
optimized for the application domain, such as processors optimized for digital 
signal processing (DSP processors), power-efficiency values approach those of 
reconfigurable logic. For general standard microprocessors, the values for this 
figure of merit are the worst. 

The energy E for a certain application is closely related to the power P required 
per operation, since 



E = 



/ 



Pdt 



Hence, reducing the power consumption also decreases the energy consump- 
tion, provided that the integral is taken over the same period of time. In some 
cases, however, a slightly increased power consumption might lead to a drastic 
reduction in the execution time and, hence, might lead to a minimized energy 
consumption. So, in some cases a minimized power consumption also corre- 
sponds to a minimized energy consumption, but this is not necessarily always 
true. 
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Minimization of power and energy eonsumption are both important. Power 
consumption has an effect on the size of the power supply, the design of the 
voltage regulators, the dimensioning of the interconnect, and short term cool- 
ing. Minimizing the energy consumption is required especially for mobile 
applications, since battery technology is only slowly improving [SEMATECH, 
2003], and since the cost of energy may be quite high. Also, a reduced en- 
ergy consumption decreases cooling requirements and improves the reliability 
(since the lifetime of electronic circuits decreases for high temperatures). 

Eig. 3.9 reflects the efficiency/flexibility conflict of currently available hard- 
ware technologies: if we want to aim at very power- and energy-efficient de- 
signs, we should not use flexible designs based on processors or re-programma- 
ble logic and if we go for excellent flexibility, we cannot be power-efficient. 
We will consider ASICs first. 

3.4.2 Application-Specific Circuits (ASICs) 

Eor high-performance applications and for large markets, application-specific 
integrated circuits (ASICs) can be designed. However, the cost of designing 
and manufacturing such chips is quite high. Eor example, the cost of the mask 
which is used for transferring geometrical patterns onto the chip can cost about 
10^ Euros or dollars. Therefore, ASICs are appropriate only if either maximum 
energy efficiency is needed and if the market accepts the costs or if a large 
number of such systems can be sold. 

3.4.3 Processors 

The key advantage of processors is their flexibility. With processors, the overall 
behavior of embedded systems can be changed by just changing the software 
running on those processors. Changes of the behavior may be required in order 
to correct design errors, to update the system to a new or changed standard or 
in order to add features to the previous system. Because of this, processors 
have become very popular. This popularity has also been stressed in the public 
press: 

At the chip level, embedded chips include micro-controllers and microproces- 
sors. Micro-controllers are the true workhorses of the embedded family. They 
are the original ’embedded chips’ and include those first employed as con- 
trollers in elevators and thermostats [Ryan, 1995]. 

Embedded processors have to be efficient and they do not need to be instruc- 
tion set compatible with commonly used personal computers (PCs). Therefore, 
their architectures may be different from those processors found in PCs. Effi- 
ciency has a number of different aspects (see page 2) : 
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■ Energy-efficiency: Architectures have to be optimized for their energy- 
efficiency and we have to make sure that we are not loosing efficiency in 
the software generation process. For example, compilers generating 50% 
overhead in terms of the number of cycles will take us further away from the 
efficiency of ASICs, possibly by even more than 50%, if the supply voltage 
and the clock frequency have to be increased in order to meet deadlines. 

There is a large amount of techniques available that can make processors 
energy efficient and energy efficiency should be considered at various levels 
of abstraction, from the design of the instruction set down to the design of 
the chip manufacturing process [Burd and Brodersen, 2003]. Gated clock- 
ing is an example of such a technique. With gated clocking, parts of the 
processor are decoupled from the clock during idle periods. For example, 
no clock is applied to the multiplier if no multiplications are executed. Also, 
there are attempts, to get rid of the clock for major parts of the processor 
altogether. There are two contrasting approaches: globally synchronous, 
locally asynchronous processors and globally asynchronous, locally syn- 
chronous processors (GALS) [Iyer and Marculescu, 2002]. 

Two techniques can be applied at a rather high level of abstraction: 

- Dynamic power management (DPM): With this approach, processors 
have several power saving states in addition to the standard operating 
state. Each power saving state has a different power consumption and 
a different time for transitions into the operating state. Fig. 3.10 shows 
the three states for the Strong Arm SA 1100 processor. 



400 mW 




Figure 3. 1 0. Dynamic power management states of the StrongArm Processor SA 1100 

The processor is fully operational in the run state. In the idle state, it 
is just monitoring the interrupt inputs. In the sleep state, all on-chip 
activity is shutdown. Note the large difference in the power consump- 
tion between the sleep state and the other states, and note also the large 
delay for transitions from the sleep to the run state. 

- Dynamic voltage scaling (DVS): This approach exploits the fact that 
the energy consumption of CMOS processors increases quadratically 







102 



EMBEDDED SYSTEM DESIGN 



with the supply voltage Vdd- The power consumption P of CMOS cir- 
cuits is given by [Chandrakasan et ah, 1992]: 



P = aCtVjdf (3.1) 

where a is the switching activity, Cl is the load capacitance, Vdd is 
the supply voltage and / is the clock frequency. The delay of CMOS 
circuits can be approximated as [Chandrakasan et ah, 1992], [Chan- 
drakasan et ah, 1995]: 



X 



k-Cf 



Vdd 

{Vdd-VtY 



(3.2) 



where k is a constant, and V] is the threshold voltage. V] has an impact 
on the transistor input voltage required to switch the transistor on. For 
example, for a maximum supply voltage of Vdd,max=^-^ volts, V) may be 
in the order of 0.8 volts. Consequently, the maximum clock frequency 
is a function of the supply voltage. However, decreasing the supply 
voltage reduces the power quadratically, while the run-time of algo- 
rithms is only linearly increased (ignoring the effects of the memory 
system). This can be exploited in a technique called dynamic voltage 
scaling (DVS). For example, the Crusoe^^ processor by Transmeta 
provides 32 voltage levels between 1.1 and 1.6 volts, and the clock can 
be varied between 200 MHz and 700 MHz in increments of 33 MHz. 
Transitions from one voltage/frequency pair to the next takes about 
20 ms. Design issues for DVS -capable processors are described in a 
paper by Burd and Brodersen [Burd and Brodersen, 2000]. According 
to the same paper, potential power savings will exist even for future 
technologies with a decreased maximum Vdd, since the threshold volt- 
ages will also be decreased (unfortunately, this will lead to increased 
leakage currents, increasing the standby power consumption). Two dif- 
ferent speed/voltage pairs are provided with the Intel® SpeedStep^^ 
technology for the Mobile Pentium® III. 



■ Code-size efficiency: Minimizing the code size is very important for em- 
bedded systems, since hard disc drives are typically not available and since 
the capacity of memory is typically also very limited. This is even more 
pronounced for systems on a chip (SOCs). For SOCs, the memory and pro- 
cessors are implemented on the same chip. In this particular case, memory 
is called embedded memory. Embedded memory may be more expensive 
to fabricate than separate memory chips, since the fabrication processes for 
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memories and processors have to be compatible. Nevertheless, a large per- 
centage of the total chip area may be consumed by the memory. There are 
several techniques for improving the code-size efficiency: 

- CISC machines: Standard RISC processors have been designed for 
speed, not for code-size efficiency. Earlier Complex Instruction Set 
Processors (CISC machines) were actually designed for code-size ef- 
ficiency, since they had to be connected to slow memories and caches 
were not frequently used. Therefore, “old-fashioned” CISC processors 
are finding applications in embedded systems. Motorola’s ColdFire 
processors, which are based on the Motorola 68000 family of CISC 
processors are an example of this. 

- Compression techniques: In order to reduce the amount of silicon 
needed for storing instructions as well as in order to reduce the en- 
ergy needed for fetching these instructions, instructions are frequently 
stored in the memory in compressed form. This reduces both the area 
as well as the energy necessary for fetching instructions. Due to the 
reduced bandwidth requirements, fetching can also be faster. A (hope- 
fully small and fast) decoder is placed between the processor and the 
(instruction) memory in order to generate the original instructions on 
the fly (see fig. 3.11, right). Instead of using a potentially large mem- 
ory of uncompressed instructions, we are storing the instructions in a 
compressed format. 




decoder 



instruction 



address 







Figure 3.11. Decompression of compressed instructions 

The goals of compression can be summarized as follows: 

* We would like to save ROM and RAM areas, since these may be 
more expensive than the processors themselves. 

* We would like to use some encoding technique for instructions and 
possibly also for data with the following properties: 

• There should be little or no run-time penalty for these tech- 
niques. 
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• Decoding should work from a limited context (it is, for exam- 
ple, impossible to read the entire program to find the destina- 
tion of a branch instruction). 

• Word-sizes of the memory, of instructions and addresses have 
to be taken into account. 

• Branch instructions branching to arbitrary destination addresses 
have to be supported. 

• Fast encoding is only required if writable data is encoded. Oth- 
erwise, fast decoding is sufficient. 

There are several variations of this scheme: 

* For some processors, there is a second instruction set. This sec- 
ond instruction set has a narrower instruction format. An example 
of this is the ARM processor family. The ARM instruction set is a 
32 bit instruction set. The ARM instruction set includes predicated 
execution. This means an instruction is executed if and only if a 
certain condition is met (see page 113). This condition is encoded 
in the first four bits of the instruction format. Most ARM pro- 
cessors also provide a second instruction set, with 16 bit wide in- 
structions, called THUMB instructions. THUMB instructions are 
shorter, since they do not support predication, use shorter and less 
register fields and use shorfer immediate fields, (see fig. 3.12). 




Figure 3.12. Re-encoding THUMB into ARM instructions 

THUMB insfrucfions are dynamically converfed info ARM insfruc- 
fions while programs are running. THUMB insfrucfions can use 
only half fhe registers in arifhmefic insfrucfions. Therefore, register- 
fields of THUMB insfrucfions are concatenated wifh a ’O’-bif^. In 
fhe THUMB insfrucfion sef, source and desfinafion regisfers are 
identical and fhe lengfh of consfanfs fhaf can be used, is reduced 
by 4 bifs. During decoding, pipelining is used fo keep fhe run-time 
penalty low. 



^Using VHDL-notation (see page 59), concatenation is denoted by an &-sign and constants are enclosed in 
quotes in fig. 3.12. 
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Similar techniques also exist for other processors. The disadvan- 
tage of this approach is that the tools (compilers, assemblers, de- 
buggers etc.) have to be extended to support a second instruction 
set. Therefore, this approach can be quite expensive in terms of 
software development cost. 

* A second approach is the use of dictionaries. With this approach, 
each instruction pattern is stored only once. For each value of the 
program counter, a look-up table then provides a pointer to the cor- 
responding instruction in the instruction table, the dictionary (see 
fig. 3.13). 




Figure 3.13. Dictionary approach for instruction compression 

This approach relies on the idea that only very few different in- 
struction patterns are used. Therefore, only few entries are re- 
quired for the the instruction table. Correspondingly, the bit width 
of the pointers can be quite small. Many variations of this scheme 
exist. Some are called two-level control store [Dasgupta, 1979], 
nanoprogramming [Stritter and Gunter, 1979], or procedure ex- 
lining [Vahid, 1995]. 

A comprehensive survey over known compression techniques is avail- 
able on the Internet [van de Wiel, 2002]. 



■ Run-time efficiency: In order to meet time constraints without having to 
use high clock frequencies, architectures can be customized to certain ap- 
plication domains, such as digital signal processing (DSP). One can even go 
one step further and design application specific insfruction set processors 
(ASIPs). As an example of domain-specific processors, we will consider 
processors for DSP. In digital signal processing, digital filtering is a very 
frequent operation. Equation 3.3 describes a digital filter generating an 
output sequence y = (yo,yi, ...) from an input sequence x = (xq,xi, ...). 
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n-l 

Yi = 

j=0 

A certain output element yi corresponds to a weighted average over the last 
n sequence elements of x and can be computed iteratively using equations 
3.4 to 3.6. 



YiJ = 


Yij-i + Xi-j*aj 


(3.4) 


where : 


Yi -1 = 0 


(3.5) 


and : 


Yi = Yi.n-l 


(3.6) 



DSPs are designed such that each iteration can be encoded as a single in- 
struction. Let us consider an example. Fig. 3.14 shows the internal archi- 
tecture of an ADSP 2100 DSP processor. 




Figure 3.14- Internal architecture of the ADSP 2100 processor 

The processor has two memories, called D and P. A special address gen- 
erating unit (AGU) can be used to provide the pointers for accessing these 
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memories. There are separate units for additions and multiplications, each 
with their own argument registers AX, AY, AF, MX, MY and MF. The multi- 
plier is connected to a second adder in order to compute series of multipli- 
cations and additions quickly. 

For this processor, the update of the partial sum is essentially performed 
in a single cycle. For this purpose, the two memories are allocated to hold 
the two arrays x and a and address registers are allocated such that relevant 
pointers can be easily updated in the AGU. Partial sums yi j are stored in 
MR. The pipelined computation involves registers A1, A2, MX and MY, as 
can be seen from the following implementation of the filter. 

MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0]; 

for (j=1 ; j<=n; j-H-) 

{MR:=MRh-MX*MY; MX:=x[A 2]; MY:=a[A1]; 

A1-I-I-; A2-- } 

A single instruction encodes the loop body, comprising the following oper- 
ations: 

- reading of two arguments from argument registers MX and MY, multi- 
plying them and adding the product to register MR storing values yi j, 

- fetching the next elements of arrays a and x from memories P and D 
and storing them in argument registers MX and MY, 

- updating pointers to the next arguments, stored in address registers A1 
and A2, 

- testing for the end of the loop. 

This way, each iteration requires just a single instruction. In order to achieve 
this, several operations are performed in parallel. For given computational 
requirements, this (limited) form of parallelism leads to relatively low clock 
frequencies. Furthermore, the registers in this architecture perform differ- 
ent functions. They are said to be heterogeneous. Heterogeneous register 
files are a common characferisfic for DSP processors. In order fo avoid 
exfra cycles for testing for fhe end of fhe loop, zero-overhead loop in- 
structions are frequently provided in DSP processors. With such instruc- 
tions, a single or a small number of instructions can be executed a fixed 
number of times. Processors not optimized for DSP would probably need 
several instructions per iteration and would therefore require a higher clock 
frequency, if available. 
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3.4.3. 1 DSP-Processors 

In addition to allowing single instruction realizations of loop bodies for filter- 
ing, DSP processors provide a number of other application-domain oriented 
features: 

■ Specialized addressing modes: In the filter application described above, 
only the last n elements of x need to be available. Ring buffers can be 
used for that. These can be implemented easily with modulo addressing. In 
modulo addressing, addresses can be incremented and decremented until 
the first or last element of the buffer is reached. Additional increments or 
decrements will result in addresses pointing to the other end of the buffer. 

■ Separate address generation units: Address generation units (AGUs) are 
typically directly connected to the address input of the data memory (see 
fig. 3.15). 




modify 
register 
file M 



Figure 3.15. AGU using special address registers 

Addresses which are available in address registers can be used in register- 
indirect addressing modes. This saves machine instructions, cycles and 
energy. In order to increase the usefulness of address registers, instruc- 
tion sets typically contain auto-increment and -decrement options for most 
instructions using address registers. 

■ Saturating arithmetic: Saturating arithmetic changes the way overflows 
and underflows are handled. In standard binary arithmetic, wrap-around is 
used for the values returned after an overflow or underflow. Fig. 3.16 shows 
an example in which two unsigned four-bit numbers are added. A carry is 
generated which cannot be returned in any of the standard registers. The 
result register will contain a pattern of all zeros. No result could be further 
away from the true result than this one. 
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Figure 3.16. Wrap-around vs. saturating arithmetic for unsigned integers 



In saturating arithmetic, we try to return a result which is as close as possi- 
ble to the true result. For saturating arithmetic, the largest value is returned 
in the case of an overflow and the smallest value is returned in the case of 
an underflow. This approach makes sense especially for video and audio 
applications: the user will hardly recognize the difference between the true 
result value and the largest value that can be represented. Also, it would be 
useless to raise exceptions if overflows occur, since it is difficult to handle 
exceptions in real-time. Note that we need to know whether we are dealing 
with signed or unsigned add instructions in order to return the right value. 

■ Fixed-point arithmetic: Floating-point hardware increases the cost and 
power-consumption of processors. Consequently, it has been estimated 
that 80 % of the DSP processors do not include floating-point hardware 
[Aamodt and Chow, 2000]. However, in addition to supporting integers, 
many such processors do support fixed-point numbers. Fixed-point data 
types can be specified by a 3-tuple (wl,iwl,sign), where wl is the total 
word-length, iwl is the integer word-length (the number of bits left of the 
binary point), and sign s G {5,«} denotes whether we are dealing with un- 
signed or signed numbers. See also fig. 3.17. Furthermore, there may be 
different rounding modes (e.g. truncation) and overflow modes (e.g. satu- 
rating and wrap-around arithmetic). 

binary point 
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fwl 
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wl 
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Figure 3. 1 1. Parameters of a fixed-point number system 

For fixed-point numbers, the position of the binary point is maintained after 
multiplication (some low order bits are truncated or rounded). For fixed- 
point processors, this operation is supported by hardware. 
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■ Real-time capability: Some of the features of modern processors used 
in PCs are designed to improve the average execution time of programs. 
In many cases, it is difficult if not impossible to formally verify that they 
improve the worst case execution time. In such cases, it may be better not to 
implement these features. For example, it is difficult (though not impossible 
[Absint, 2002]) to guarantee a certain speedup resulting from the use of 
caches. Therefore, many embedded processors do not have caches. Also, 
virtual addressing and demand paging are normally not found in embedded 
systems. 

■ Multiple memory banks or memories: the usefulness of multiple mem- 
ory banks was demonstrated in the ADSP 2100 example: the two memories 
D and P allow fetching both arguments at the same time. Several DSP pro- 
cessors come with two memory banks. 

■ Heterogenous register files: heterogenous register files were already men- 
tioned for the filter application. 

■ Multiply/accumulate instructions: these instructions perform multipli- 
cations followed by additions. They were also already used in the filter 
application. 

3. 4. 3. 2 Multimedia processors 

Registers and arithmetic units of many modern architectures are 64 bits wide. 

Therefore, two 32 bit data types (“double words”), four 16 bit data types 

(“words”) or eight 8 bit data types (“bytes”) can be packed into a single register 

(see fig. 3.18). 
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Figure 3.18. Using 64 bit registers for packed words 

Arithmetic units can be designed such that they suppress carry bits at double 
word, word or byte boundaries. Multimedia instruction sets exploit this fact by 
supporting operations on packed data types. Such instructions are sometimes 
called single-instruction, multiple-data (SIMD) instructions, since a single in- 
struction encodes operations on several data elements. With bytes packed into 
64-bit registers, speed-ups of up to about eight over non-packed data types are 
possible. Data types are typically stored in packed form in memory. Unpacking 
and packing are avoided if arithmetic operations on packed data types are used. 
Furthermore, multimedia instructions can usually be combined with saturating 
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arithmetic and therefore provide a more efficient form of overflow handling 
than standard instructions. Hence, the overall speed-up achieved with multi- 
media instructions can be significantly larger than the factor of eight enabled 
by operations on packed data types. 

3.4. 3. 3 Very long instruction word (VLIW) processors 

Computational demands for embedded systems are increasing, especially when 
multimedia applications, advanced coding techniques or cryptography are in- 
volved. Performance improvement techniques used in high-performance mi- 
croprocessors are not appropriate for embedded systems: driven by the need 
for instruction set compatibility, processors found, for example, in PCs spend 
a huge amount of resources and energy on automatically finding parallelism in 
application programs. Still, their performance is frequently not sufficient. For 
embedded systems, we can exploit the fact that instruction set compatibility 
with PCs is not required. Therefore, we can use instructions which explicitely 
identify operations to be performed in parallel. This is possible with explicit 
parallelism instruction set computers (EPICs). With EPICs, detection of 
parallelism is moved from the processor to the compiler^. This avoids spend- 
ing silicon and energy on the detection of parallelism at runtime. As a special 
case, we consider very long instruction word (VLIW) processors. Eor VLIW 
processors, several operations or instructions are encoded in a long instruction 
word (sometimes called instruction packet) and are assumed to be executed 
in parallel. Each operation/instruction is encoded in a separate field of the in- 
struction packet. Each field controls certain hardware units. Lour such fields 
are used in fig. 3.19, each one controlling one of the hardware units. 



instruction packet s- 




Figure 3.19. VLIW architecture (example) 

Eor VLIW architectures, the compiler has to generate instruction packets. This 
requires that the compiler is aware of the available hardware units and to sched- 
ule their use. 



^EPICs are sometimes also used for PCs [Transmeta, 2005, Intel, 2005]. However, legacy problems result 
in severe constraints for doing this. 
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Instruction fields must be present, regardless of whether or not the correspond- 
ing functional unit is actually used in a certain instruction cycle. As a result, 
the code density of VLIW architectures may be low, if insufficient parallelism 
is detected to keep all functional units busy. The problem can be avoided if 
more flexibility is added. For example, the Texas Instruments TMS 320C6xx 
family of processors implements a variable instruction packet size of up to 256 
bits. In each instruction field, one bit is reserved to indicate whether or not the 
operation encoded in the next field is still assumed to be executed in parallel 
(see fig. 3.20). No instruction bits are wasted for unused functional units. 
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Figure 3.20. Instruction packets for TMS 320C6xx 

Due to its variable length instruction packets, TMS 320C6xx processors do 
not quite correspond to the classical model of VLIW processors. Due to their 
explicit description of parallelism, they are EPIC processors, though. 

Partitioned Register Files. Implementing register files for VLIW and 
EPIC processors is far from trivial. Due to the large number of operations 
that can be performed in parallel, a large number of register accesses has to 
be provided in parallel. Therefore, a large number of ports is required. How- 
ever, the delay, size and energy consumption of register files increases with 
their number of ports. Hence, register files with very large numbers of ports 
are inefficient. As a consequence, many VLIW/EPIC architectures use parti- 
tioned register files. Eunctional units are then only connected to a subset of 
the register files. As an example, fig. 3.21 shows the internal structure of the 
TMS 320C6xx processors. These processors have two register files and each 
of them is connected to half of the functional units. During each clock cycle, 
only a single path from one register file to the functional units connected to the 
other register file is available. 

Alternative partitionings are considered by Lapinskii et al. [Lapinskii et al., 

2001 ]. 

Many DSP processors are actually VLIW processors. As an example, we are 
considering the M3-DSP processor [Eettweis et al., 1998]. The M3-DSP pro- 
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Figure 3.21. Partitioned register files for TMS 320C6xx 

cessor is a VLIW processors containing (up to) 16 parallel data paths. These 
data paths are connected to a group memory, providing the necessary argu- 
ments in parallel (see fig. 3.22). 




Figure 3.22. M3-DSP (simplified) 



Predicated Execution. A potential problem of VLIW and EPIC archi- 
tectures is their potentially large delay penalty: This delay penalty might orig- 
inate from branch instructions found in some instruction packets. Instruction 
packets normally have to pass through pipelines. Each stage of these pipelines 
implements only part of the operations to be performed by the instructions ex- 
ecuted. The fact that branch instructions exist cannot be detected in the first 
stage of the pipeline. When the execution of the branch instruction is finally 
completed, additional instructions have already entered the pipeline (see fig. 
3.23). 
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Figure 3.23. Branch instruction and delay slots 
There are essentially two ways to deal with these additional instructions: 

1 They are executed as if no branch had been present. This case is called de- 
layed branch. Instruction packet slots that are still executed after a branch 
are called branch delay slots. These branch delay slots can be filled with 
instructions which would be executed before the branch if there were no 
delay slots. However, it is normally difficult to fill all delay slots with use- 
ful instructions and some have to be filled with no-operation instructions 
(NOPs). The term branch delay penalty denotes the loss of performance 
resulting from these NOPs. 

2 The pipeline is stalled until instructions from the branch target address have 
been fetched. There are no branch delay slots in this case. In this organiza- 
tion the branch delay penalty is caused by the stall. 

Branch delay penalties can be significant. For example, the TMS 320C6xx 
family of processors has up to 40 delay slots. Therefore, efficiency can be im- 
proved by avoiding branches, if possible. In order to avoid branches originating 
from if-statements, predicated instructions have been introduced. For each 
predicated instruction, there is a predicate. This predicate is encoded in a few 
bits and evaluated at run-time. If the result is true, the instruction is executed. 
Otherwise, it is effectively turned into a NOP. Predication can also be found in 
RISC machines such as the ARM processor. Example: ARM instructions, as 
introduced on page 104, include a four-bit field. These four bits encode vari- 
ous expressions involving the condition code registers. Values stored in these 
registers are checked at run-time. They determine whether or not a certain 
instruction has an effect. 

Predication can be used to implement small if-statements efficiently: the con- 
dition is stored in one of the condition registers and if-statement-bodys are 
implemented as predicated instructions which depend on this condition. This 
way, if-statement bodys can be evaluated in parallel with other operations and 
no delay penalty is incurred. 
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A large number of the processors in embedded systems are in fact micro- 
controllers. Micro-controllers are typically not very complex and can be used 
easily. Due to their relevance for designing control systems, we introduce one 
of the most frequently used processors: the Intel 8051. This processor has the 
following characteristics: 

■ 8 bit CPU, optimized for control applications, 

■ large set of operations on Boolean data types, 

■ program address space of 64 k bytes, 

■ separate data address space of 64 k bytes, 

■ 4k bytes of program memory on chip, 128 bytes of data memory on chip, 

■ 32 I/O lines, each of which can be addressed individually, 

■ 2 counters on the chip, 

■ universal asynchronous receiver/transmitter for serial lines available on the 
chip, 

■ clock generation on the chip, 

■ many variations commercially available. 

All these characteristics are quite typical for micro-controllers. 

3.4.4 Reconfigurable Logic 

In many cases, full custom hardware chips (ASICs) are too expensive and 
software-based solutions are too slow or too energy consuming. Reconfig- 
urable logic provides a solution if algorithms can be efficiently implemented 
in custom hardware. It can be almost as fast as special purpose hardware. In 
contrast to special purpose hardware, the performed function can be changed 
by using configuration data. Due to these properties, reconfigurable logic finds 
applications in the following areas: 

■ Fast prototyping: modern ASICs can be very complex and the design 
effort can be large and take a long time. It is therefore frequently desirable 
to generate a prototype, which can be used for experimenting with a system 
which behaves “almost” like the final sysfem. The profofype can be more 
cosily and larger fhan fhe final sysfem. Also, ifs power consumpfion can 
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be larger than the final system, some timing eonstraints can be relaxed, and 
only the essential functions need to be available. Such a system can then be 
used for checking the fundamental behavior of the future system. 

■ Low volume applications: If the expected market volume is too small 
to justify the development of special purpose ASICs, reconfigurable logic 
can be the right hardware technology for applications, which cannot be 
implemented in software. 

Reconfigurable hardware typically includes random access memory (RAM) 
to store configurations during normal operation of the hardware. Such RAM 
is normally volatile (the information is stored only while power is applied). 
Therefore, the configuration data must be copied into the configuration RAM at 
power-up. Persistent storage technology such as read-only memories (ROMs) 
and Flash memories will then provide the configuration data. 

Field programmable gate arrays (FPGAs) are the most common form of re- 
configurable hardware. As the name indicates, such devices are programmable 
“in the field” (after fabrication). Furthermore, they consist of arrays of pro- 
cessing elements. As an example, fig. 3.24 shows the array structure of Xilinx 
Virtex-II arrays (see http://www.xilinx.com). 



Digital clock manager I/O Blocks 




Figure 3. 24 ■ Floor-plan of Virtex II FPGAs 

Currently (in 2003), Virtex II arrays contain up to 112 x 104 configurable 
logic blocks (CLBs). These can be connected using a programmable inter- 
connect structure. Arrays also contain up to 1108 input/output connections 
and special clock processing blocks. In addition, there are up to 168 18 x 18 
bit multipliers and 3024 kbits of RAM (Block RAM). Each CLB consists of 4 
so-called slices (see hg. 3.25). 
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Figure 3.25. Virtex II CLB 



Each slice contains two 16 bit memories F and G. These memories can be 
used as look-up tables (LUT) for implementing all 2^^ Boolean functions of 
4 variables. With the help of multiplexers (MUXF5, MUXFx), several of these 
memories can also be combined such that table look-ups for up to 8 variables 
are possible. They can also serve as ordinary RAM or as shift registers (SRLs). 
Each slice also includes two output registers and some special logic (ORGY, 
CY, etc.) for additions (see fig. 3.26). 




Figure 3.26. Virtex II Slice (simplified) 

Configuration data determines the setting of multiplexers in the slices, the 
clocking of registers and RAM, the content of RAM components and the con- 
nections between CLBs. Typically, this configuration data is generated from 
a high-level description of the functionality of the hardware, for example in 
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VHDL. Ideally, the same description could also be used for generating ASICs 
automatically. In practice, some interaction is required. 

Integration of reconfigurable computing with processors and software is sim- 
plified with the Virtex II Pro series of FPGAs from Xilinx. These FPGAs 
contain up to 4 Power-PC processors and faster I/O blocks. 

3.5 Memories 

Data, programs and FPGA configurations must be stored in some kind of mem- 
ory. This must be done in an efficient way. Efficient means run-time, code-size 
and energy-efficient. Code-size efficiency requires a good compiler and can 
be improved with code compression (see page 103). Memory hierarchies can 
be exploited in order to achieve a good run-time and energy efficiency. The 
underlying reason is that large memories require more energy per access and 
are also slower than small memories. 

Fig. 3.27 shows the cycle time and the power as a function of the size of the 
memory [Rixner et ah, 2000]. The same behavior can be observed for larger 
memories. 




16 32 64 128 16 32 64 128 



Figure 3.27. Cycle time and power as a function of the memory size 

It has been observed that the difference in speeds between processors and mem- 
ories is expected to increase (see fig. 3.28). 

While the speed of memories is increasing by only a factor of about 1.07 per 
year, processor speeds are so far increasing by a factor of 1.5 to 2 per year 
[Machanik, 2002]. This means that the gap between processor speeds and 
memory speeds is becoming larger. 

Therefore, it is important to use smaller and faster memories that act as buffers 
between the main memory and the processor. In contrast to PC-like systems, 
the architecture of these small memories must guarantee a predictable real- 
time performance. A combination of small memories containing frequently 
used data and instructions and a larger memory containing the remaining data 
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and instructions is generally also more energy efficient than a single, large 
memory. 

Caches were initially introduced in order to provide good run-time efficiency. 
In the context of fig. 3.27 (righf) however, if is obvious fhaf caches pofen- 
fially also improve fhe energy-efficiency of a memory sysfem. Accesses fo 
caches are accesses fo small memories and Iherefore may require less energy 
per access fhan large memories. However, for caches if is required fhaf fhe 
hardware checks whefher or nol fhe cache has a valid copy of fhe informa- 
lion associaled wilh a cerlain address. This checking involves checking fhe 
lag fields of caches, conlaining a subsel of fhe relevanl address bils [Hennessy 
and Pallerson, 1996]. Reading Ihese lags requires addilional energy. Also, fhe 
prediclabilily of fhe real-lime performance of caches is frequenlly low. 

Alfernalively, small memories can be mapped info fhe address space (see fig. 
3.29). 
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Figure 3.29. Memory map with scratch-pad included 



Such memories are called scratch pad memories (SPMs). Frequenlly used 
variables and inslruclions should be allocafed fo fhaf address space and no 
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checking needs to be done in hardware. As a result, the energy per access is 
reduced. Fig. 3.30 shows a comparison between the energy required per access 
to the scratch-pad (SPM) and the energy required per access to the cache. 
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Figure 3.30. Energy consumption per scratch pad and cache access 



For a two-way set associative cache, the two values differ by a factor of about 
three. The values in this example were computed using the energy consump- 
tion for RAM arrays as estimated by the CACTI cache estimation tool [Wilton 
andJouppi, 1996]. 

SPMs can improve the memory access times very predictably, if the compiler 
is in charge of keeping frequently used variables in the SPM (see page 180). 



3.6 Output 

Output devices of embedded systems include 



■ displays: Display technology is an area which is extremely important. Ac- 
cordingly, a large amount of information [Society for Display Technology, 
2003] exists on this technology. Major research and development efforts 
lead to new display technology such as organic displays [Gelsen, 2003]. 
Organic displays are emitting light and can be fabricated with very high 
densities. In contrast to LCD displays, they do not need back-light and 
polarizing filters. Major changes are therefore expected in these markets. 

■ electro-mechanical devices: these influence the environment through mo- 
tors and other electro-mechanical equipment. 



Analog as well as digital output devices are used. In the case of analog output 
devices, the digital information must first be converted by digital-to-analog 
(D/A)-converters. 
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3.6.1 D/A-converters 

D/A-converters are not very complex. Fig. 3.31 shows the schematic of a sim- 
ple D/A converter. 




Figure 3.31. D/A-converter 

The operational amplifier shown in fig. 3.31 amplifies fhe volfage difference 
befween fhe fwo inpufs by a very large facfor (some powers of fen). Due fo 
resisfor Ri, resulting oufpuf volfages are fed back fo inpuf -. Whenever a small 
volfage befween fhe fwo inpufs exisf, if will be inverted, amplified and fed 
back fo fhe inpufs, reducing fhe inpuf volfage. Due fo fhe large amplificalion, 
fhe differenlial volfage befween fhe fwo inpufs is reduced fo virfually zero. 
Since inpuf + is connecfed fo ground, fhe volfage befween inpuf - and ground 
is virfually zero and fhe volfage af inpuf - and ground is said fo be virfually 
zero. “Virfually” means: zero excepf for some very small volfage resulting 
from a potentially imperfecl operafional amplifier and pracfically zero. 

The key idea is fo firsf generate a currenf which is proportional fo fhe value 
represenfed by a bif-vecfor x and fo converf fhis currenf info an equivalenf 
volfage. 

According fo Kirchhoff ’s laws, currenf 1 is fhe sum of fhe currenfs fhrough fhe 
resisfors. The currenf fhrough any resisfor is zero, if fhe corresponding elemenf 
of bif-vecfor x is ’O’. If if is ’T, fhe currenf corresponds fo fhe weighf of fhaf bif, 
since resisfor values are chosen accordingly. 
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(3.7) 



Also, according fo Kirchhoff’s laws and due fo fhe virfual zero af inpuf -, we 
have: V + R\* I' = 0. 
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The current into the inputs of the operational amplifier are practically zero, and 
the two currents I and 1' are equal: 1 = Hence: 



V+Ri*l = 0 (3.8) 

From equations 3.7 and 3.8 we obtain: 

R ^ R 

-V = (3.9) 

nat denotes the natural number represented by bit-vector x. Obviously, the 
output voltage is proportional to the value represented by x. Positive output 
voltages and bit-vectors representing two’s complement numbers require mi- 
nor extensions. 

An interesting question is this one: suppose that the processors used in the 
hardware loop forward values from A/D-converters unchanged to the D/A- 
converters. Would it be possible to reconstruct the original analog voltage 
from the sensor outputs at the outputs of the D/A-converters? According to 
Nyquist’s sampling theorem (see Oppenheim et al. [Oppenheim et ah, 1999]), 
it is indeed possible, provided that the clock frequency of the sample-and-hold 
circuit is at least twice as large as the largest frequency found in the input volt- 
age. This does, however, apply only if we have an infinite precision of the 
digital values. The limited precision of digital values effectively adds some 
noise to the digital signals [Oppenheim et ah, 1999], which cannot be com- 
pletely removed. 

3.6.2 Actuators 

There is a huge amount of actuators [Elsevier B.V., 2003a]. Actuators range 
from huge ones that are able to move tons of weight to tiny ones with dimen- 
sions in the fim area, like the one shown in fig. 3.32. 

It is impossible to provide an overview. As an example, we mention only 
a special kind of actuators which will become more important in the future: 
microsystem technology enables the fabrication of tiny actuators, which can 
be put into the human body, for example. 

Using such tiny actuators, the amount of drugs fed into the body can be adapted 
to the actual need. This allows a much better medication than needle-based 
injections. Fig. 3.32 shows a tiny motor manufactured with microsystem tech- 
nology. The dimensions are in the /<m range. The rotating center is controlled 
by electrostatic forces. 
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Figure 3.32. Microsystem technology based actuator motor (partial view; courtesy E. Ober- 
meier, MAT, TU Berlin), (c)TU Berlin 





Chapter 4 



STANDARD SOFTWARE: 

EMBEDDED OPERATING SYSTEMS, 
MIDDLEWARE, AND SCHEDULING 



Not all components of embedded systems need to be designed from scratch. 
Instead, there are standard components that can be reused. These components 
comprise of knowledge from earlier design efforts and constitute intellectual 
property (IP). IP reuse is one key technique in coping with the increasing com- 
plexity of designs. Re-using available software components is at the heart of 
the platform-based design methodology, which will be briefly presented start- 
ing at page 151. 

Standard software components that can be reused include: embedded operat- 
ing systems (OS), real-time databases, and other forms of middleware. The last 
term denotes software providing an intermediate layer between the OS and ap- 
plication software (including, for example, libraries for communication). Calls 
to standard software components may already have to be included in the spec- 
ification. Therefore, information about the application programming interface 
(API) of these standard components may already be needed for completing 
executable specifications. 

Also, there are some standard approaches for scheduling which must be taken 
into account and which the designer must be aware of. Particular scheduling 
approaches may or may not be supported by a certain operating system. This 
constraint must also be taken into account. 

Consistent with the design information flow, we will be describing embedded 
operating systems, middleware and scheduling in this chapter (see also fig. 
4.1). 
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Figure 4-1. Simplified design information flow 



4.1 Prediction of execution times 

Scheduling of tasks requires some knowledge about the duration of task ex- 
ecutions, especially if meeting time constraints has to be guaranteed, as is in 
real-time (RT) systems. The worst-case execution time is the basis for most 
scheduling algorithms. 

Def.: The worst-case execution time (WCET) is an upper bound on the exe- 
cution times of tasks. 

Computing such a bound is undecidable in the general case. This is obvious 
from the fact that it is undecidable whether or not a program terminates. Hence, 
the WCET can only be computed for certain programs/tasks. Eor example, for 
programs without recursion and while loops and with constant iteration counts, 
the WCET can be computed. 

Computing tight upper bounds on the execution time may still be difficult. 
Modern processor architectures’ pipelines with their different kinds of hazards 
and memory hierarchies with limited predictability of hit rates are a source of 
serious overestimations of the WCET. Sometimes, architectural features which 
reduce the average execution time but cannot guarantee to reduce the WCET 
are completely omitted from the design (see page 1 10). Computing the WCET 
for systems containing caches and pipelines is a research topic (see, for ex- 
ample, Healy et al. [Healy et ah, 1999] and the web pages of absint [Absint, 
2002]). Interrupts and virtual memory (if present) result in more complica- 
tions. Accordingly, it is already difficult to compute the WCET for assembly 
language programs. Computing tight bounds from a program written in a high- 
level language such as C without any knowledge of the generated assembly 
code is impossible. 
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The WCET may be required for different target technologies. If tasks are 
mapped to hardware, the WCET of that hardware needs to be computed. This 
may, in turn, require the synthesis of this hardware. Another approach is to 
respect timing constraints in hardware synthesis. 

Eor some of the design phases, we might also need information on estimated 
average case execution times, in addition to WCETs. Two different approaches 
that have been proposed are the following: 



■ Estimated cost and performance values: Quite a number of estimators 
have been proposed for this purpose. Examples include the work by Jha 
and Dutt [Jha and Dutt, 1993] for hardware, and Jain et al. [Jain et ah, 
2001] for software. Generating sufficiently precise estimates requires some 
efforts. 



■ Accurate cost and performance values: This is only possible if interfaces 
to “software synthesis tools” (compilers) and hardware synthesis tools ex- 
ist. This method can be more precise than the previous one, but may be 
significantly (and sometimes prohibitively) time consuming. 



In order to find good esfimafes communication must also be considered. Un- 
fortunately, it is very hard to predict the communication cost. 

4.2 Scheduling in real-time systems 

As indicated above, scheduling is one of the key issues in implementing RT- 
systems. Scheduling algorithms may be required a number of times during 
the design of RT-systems. Very rough calculations may already be required 
while fixing the specification. During hardware/software partitioning, some- 
what more detailed predictions of execution times may be required. After 
compilation, even more detailed knowledge exists about the execution times 
and accordingly, more precise schedules can be made. Einally, it may be nec- 
essary to decide at run-time which task is to be executed next. Scheduling is 
somewhat linked to performance evaluation, mentioned at the bottom of figure 
4. 1 . Eike performance evaluation, if cannof be consfrained fo a single design 
step. We include scheduling in fhis chapter since if is closely linked fo fhe 
RTOS, buf fhe reader has fo keep in mind fhaf some of fhe scheduling fech- 
nique are independenf of fhe RTOS. In fhe case of design-time scheduling, 
RTOS scheduling may be limifed fo simple fable look-ups for fasks fo be exe- 
cuted. 
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4.2.1 Classification of scheduling algorithms 

Scheduling algorithms can be classified according to various criteria. Fig. 4.2 
shows a possible classification of algorithms (similar schemes are described in 
books on the topic [Balarin et ah, 1998], [Kwok and Ahmad, 1999], [Stankovic 
et ah, 1998], [Liu, 2000], [Buttazzo, 2002]). 



real-time scheduling 




preemptive non-preemptive preemptive non-preemptive 




static dynamic static dynamic static dynamic static dynamic 

Figure J^.2. Classes of scheduling algorithms 



■ Soft and hard deadlines: Scheduling for soft deadlines is frequently based 
on extensions to standard operating systems. For example, providing task 
and operating system call priorities may be sufficient for systems with soft 
deadlines. We will not discuss these systems further in this book. More 
work and a detailed analysis is required for hard deadline systems. For 
these, we can use dynamic and static schedulers. 

■ Scheduling for periodic and aperiodic tasks 

In the following, we will distinguish between periodic, aperiodic and spo- 
radic tasks. 

Definition: Tasks which must be executed once every p units of time are 
called periodic tasks, and p is called their period. Each execution of a 
periodic task is called a job. 

Definition: Tasks which are not periodic are called aperiodic. 

Definition: Aperiodic tasks requesting the processor at unpredictable times 
are called sporadic, if there is a minimum separation between the times at 
which they request the processor. 

■ Preemptive and non-preemptive scheduling: Non-preemptive schedulers 
are based on the assumption that tasks are executed until they are done. As 
a result the response time for external events may be quite long if some 
tasks have a large execution time. Preemptive schedulers have to be used 
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if some tasks have long exeeution times or if the response time for external 
events is required to be short. 

■ Static and dynamic scheduling: Dynamic schedulers take decisions at 
run-time. They are quite flexible, but generate overhead at run-time. 

Also, they are usually not aware of global contexts such as resource re- 
quirements or dependences between tasks. For embedded systems, such 
global contexts are typically available at design time and they should be 
exploited. Static schedulers take their decisions at design time. They are 
based on planning the start times of tasks and generate tables of start times 
forwarded to a simple dispatcher. The dispatcher does not take any de- 
cisions, but is just in charge of starting tasks at the times indicated in the 
table. The dispatcher can be controlled by a timer, causing the dispatcher to 
analyze the table. Systems which are totally controlled by a timer are said 
to be entirely time triggered (TT systems). Such systems are explained in 
detail in the book by Kopetz [Kopetz, 1997]: 

In an entirely time-triggered system, the temporal control structure of all 
tasks is established a priori by off-line support-tools. This temporal con- 
trol structure is encoded in a Task-Descriptor List (TDL) that contains 
the cyclic schedule for all activities of the node (Figure 4.3). This sched- 
ule considers the required precedence and mutual exclusion relationships 
among the tasks such that an explicit coordination of the tasks by the op- 
erating system at run time is not necessary. Figure 4.3 includes scheduled 
task start, task stop and send message (send) activities. 



Time 


Action 


WCET 


10 


start T 1 


12 


17 


send M5 




22 


stop T 1 




38 


start T2 


20 


47 


send M3 






Figure 4-3. Task descriptor list in a TT operating system 

The dispatcher is activated by the synchronized clock tick. It looks at the 
TDL, and then performs the action that has been planned for this instant 

The main advantage of static scheduling is that it can be easily checked if 
timing constraints are met: 
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For satisfying timing constraints in hard real-time systems, predictability of 
the system behavior is the most important concern; pre-run-time scheduling 
is often the only practical means of providing predictability in a complex 
system (according to Xu and Parnas, as cited by Kopetz). 

The main disadvantage is that the response to sporadic events may be quite 
poor. 

■ Centralized and distributed scheduling: Multiprocessor scheduling al- 
gorithms can either be executed locally on one processor or can be dis- 
tributed among a set of processors. 

■ Type and complexity of schedulability test: 

In practice, it is very important to know whether or not a schedule exists 
for a given set of tasks and constraints. 

A set of tasks is said to be schedulable under a given set of constraints, if a 
schedule exists for that set of tasks and constraints. For many applications, 
schedulability tests are important. Tests which never give wrong results 
(called exact tests) are NP-hard in many situations [Garey and Johnson, 
1979]. Therefore, sufficient and necessary tests are used instead. For suf- 
ficient tests, sufficient conditions for guaranteeing a schedule are checked. 
There is a (hopefully small) probability of indicating that no schedule exists 
even if there exists one. Necessary tests are based on checking necessary 
conditions. They can be used to show that no schedule exists. However, 
there may be cases in which no schedule exists and we may still be unable 
to prove this. 

■ Mono- and multi-processor scheduling: Simple scheduling algorithms 
handle the case of single processors, whereas more complex algorithms 
also handle systems comprising multiple processors. For the latter, we 
can distinguish between algorithms for homogeneous multi-processor sys- 
tems and algorithms for heterogenous multi-processor systems. The latter 
are able to handle target-specific execufion fimes and can also be applied 
fo mixed hardware/soffware sysfems, in which some fasks are mapped fo 
hardware. 

■ Online- and offline scheduling: Online scheduling algorifhms schedule 
fasks af run-fime, based on fhe informafion abouf fhe fasks arrived so far. In 
confrasf, offline algorifhms schedule fasks, faking a priori knowledge abouf 
arrival limes, execufion limes, and deadlines info accounl. 

■ Cost function: Different algorithms aim at minimizing different functions. 

Def.: Maximum lateness is defined as the difference between the comple- 
tion time and the deadline, maximized over all tasks. Maximum lateness is 
negative if all tasks complete before their deadline. 
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■ Independent and dependent tasks: 

It is possible to distinguish between tasks without any inter-task commu- 
nication (in the following called simple, or S-tasks) and other tasks, called 
complex tasks. S-tasks can be in one out of two states: ready or running. 

The API of a TT-OS supporting S-tasks is quite simple: The application 
program interface (API) of an S-task in a TT system consists of three data 
structures and two operating system calls. ... The system calls are TERMI- 
NATE TASK and ERROR. The TERMINATE TASK system call is executed 
whenever the task has reached its termination point. In case of an error 
that cannot be handled within the application task, the task terminates its 
operation with the ERROR system call [Kopetz, 1997]. 

4.2.2 Aperiodic scheduling 

4. 2. 2.1 Scheduling with no precedence constraints 

Let {7]} be a set of tasks. Let (see fig. 4.4): 

■ Ci be the execution time of 7], 

■ di be the deadline interval, that is, the time between 7]- becoming available 
and the time until which 7] has to finish execution. 

■ li be fhe laxity or slack, defined as 



ii — di Cl 



(4.1) 



Availability of Task i 




Figure 4-4- Definition of the laxity of a task 

If li = 0, fhen Ti has fo be sfarfed immediafely affer if becomes execufable. 

Lef us firsl consider^ fhe case of uni-processor systems for which all fasks 
arrive af fhe same lime. If all fasks arrive al fhe same time, preemplion is 
obviously useless. 



*We are using some of the material from the book by Buttazzo [Buttazzo, 2002] for this section. Refer to 
this book for additional references. 
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A very simple scheduling algorithm for this case was found by Jackson in 
1955 [Jackson, 1955]. The algorithm is based on Jackson’s rule: Given a 
set of n independent tasks, any algorithm that executes the tasks in order of 
nondecreasing deadlines is optimal with respect to minimizing the maximum 
lateness. The algorithm is called Earliest Due Date (EDD). EDD requires all 
tasks to be sorted by their deadlines. If the deadlines are known in advance, 
EDD can be implemented as a static scheduling algorithm. Hence, its com- 
plexity is 0{n\og{n)). 

Let us consider the case of different arrival times for uni-processor systems 
next. Under this scenario, preemption can potentially reduce maximum late- 
ness. 

The Earliest Deadline Eirst (EDE) algorithm is optimal with respect to min- 
imizing the maximum lateness. It is based on the following theorem [Horn, 
1974]: Given a set ofn independent tasks with arbitrary arrival times, any al- 
gorithm that at any instant executes the task with the earliest absolute deadline 
among all the ready tasks is optimal with respect to minimizing the maximum 
lateness. See Buttazzo [Buttazzo, 2002] for the proof of this property. EDE 
requires that, each time a new ready task arrives, it is inserted into a queue 
of ready tasks, sorted by their deadlines. Hence, EDE is a dynamic schedul- 
ing algorithm. If a newly arrived task is inserted at the head of the queue, the 
currently executing task is preempted. If sorted lists are used for the queue, 
the complexity of EDE is 0{n^). Bucket arrays could be used for reducing the 
execution time. 

Eig. 4.5 shows a schedule derived with the EDE algorithm. Vertical bars indi- 
cate the arrival of tasks. 



Task arrivals 

I \ N 

I ' ' 




T3 i i 

I ^ I I I I I I ^ I I I I I I ^ I I I ^ I I 

0 2 4 6 8 10 12 14 16 18 20 22 t 

Figure 4- 5- EDF schedule 

At time 4, task T2 has an earlier deadline. Therefore it preempts T1 . At time 
5, task T3 arrives. Due to its later deadline it does not preempt T2. 
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Least Laxity (LL), Least Slack Time First (LST), and Minimum Laxity First 
(MLF) are three names for another scheduling strategy [Liu, 2000]. According 
to LL scheduling, task priorities are a monotonically decreasing function of 
the laxity (see equation 4.1; the less laxity, the higher the priority). The laxity 
is dynamically changing. LL scheduling is also preemptive. Fig. 4.6 shows an 
example of an LL schedule, together with the computations of the laxity. 



T1 

T2 

T3 



Z(T1)=33-15-6=12 

;Z(T3)=29-15-2=12 
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Figure 4.6. Least laxity schedule 

At time 4, task T1 is preempted, as before. At time 5, T2 is now also preempted, 
due to the lower laxity of task T3. 

It can be shown (this is left as an exercise in [Liu, 2000]) that LL is also an 
optimal scheduling policy for mono-processor systems in this sense that it will 
find a schedule if one exists. Due to its dynamic priorities, it cannot be used 
with a standard OS providing only fixed priorifies. LL scheduling requires pe- 
riodic checks of fhe laxify, and (in confrasf fo EDF scheduling) fhe knowledge 
of fhe execution time (and fakes if info accounf). 

If preemption is not allowed, optimal schedules may have fo leave fhe pro- 
cessor idle af cerfain times in order fo finish fasks wifh early deadlines arriving 
lafe. 

Proof: Lef us assume fhaf an opfimal non-preempfive scheduler (nof having 
knowledge abouf fhe fufure) never leaves fhe processor idle. This scheduler 
will fhen have fo schedule fhe example of fig. 4.7 opfimally (if will have fo find 
a schedule if one exisfs). 

For fhe example of fig. 4.7 we assume we are given fwo fasks. Lef T1 be a 
periodic process wifh an execution fime of 2, a period of 4 and a deadline 
inferval of 4. Lef T2 be a fask occasionally becoming available af fimes A*n+\ 
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Figure 4- 1- Scheduler needs to leave processor idle 



and having an execution time and a deadline interval of 1 . Let us assume that 
the concurrent execution of T1 and T2 is not possible due to some resource 
conflict. Under the above assumptions our scheduler has to start the execution 
of task T1 at time 0, since it is supposed not to leave any idle time. Since the 
scheduler is non-preemptive, it cannot start T2 when it becomes available at 
time 1. Hence, T2 misses its deadline. If the scheduler had left the processor 
idle (as shown in fig. 4.7 at time 4), a legal schedule would have been found. 
Hence, the scheduler is not optimal. This is a contradiction to the assumptions 
that optimal schedulers not leaving the processor idle at certain times exist, 
q.e.d. 

We conclude: In order to avoid missed deadlines the scheduler needs knowl- 
edge about the future. If no knowledge about the arrival times is available a 
priori, then no online algorithm can decide whether or not to keep the pro- 
cessor idle. It has been shown that EDF is still optimal among all scheduling 
algorithms not keeping the processor idle at certain times. If arrival times 
are known a priori, the scheduling problem becomes NP-hard in general and 
branch and bound techniques are typically used for generating schedules. 

4. 2. 2. 2 Scheduling with precedence constraints 

We start with a task graph reflecting tasks dependences (see fig. 4.8). Task T3 
can be executed only after tasks T 1 and T2 have completed and sent messages 
toT3. 

This figure also shows a legal schedule. For static scheduling, this schedule 
can be stored in a table, indicating to the dispatcher the times at which tasks 
must be started and at which messages must be exchanged. 

An optimal algorithm for minimizing the maximum lateness for the case of 
simultaneous arrival times was presented by Fawler [Fawler, 1973]. The al- 
gorithm is called Latest Deadline First (LDF). LDF is based on a total order 
compatible with the partial order described by the task graph. LDF reads the 
task graph and, among the tasks with no successors, moves the task with the 
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Figure 4-S. Precedence graph and schedule 

latest deadline into a queue. It then repeats this process for the remaining tasks. 
If there is just a global time constraint, LDF essentially performs a topological 
sort [Sedgewick, 1988]. At run-time, the tasks are executed in the generated 
total order. LDF is non-preemptive and is optimal for mono-processors. 

The case of asynchronous arrival times can be handled with a modified EDF 
algorithm. The key idea is to transform the problem from a given set of de- 
pendent tasks into a set of independent tasks with different timing parameters 
[Ghetto et ah, 1990]. This algorithm is again optimal for uni-processor sys- 
tems. 

If preemption is not allowed, the heuristic algorithm developed by Stankovic 
and Ramamritham [Stankovic and Ramamritham, 1991] can be used. 

4.2.3 Periodic scheduling 

4. 2. 3.1 Notation 

Next, we will consider the case of periodic tasks. For periodic scheduling, the 
best that we can do is to design an algorithm which will always find a schedule 
if one exisfs. A scheduler is defined fo be optimal iff it will find a schedule if 
one exists. 

Let {7] } be a set of tasks. Each execution of some task 7]- is called a job. The 
execution time for each job corresponding to one task is assumed to be the 
same. Eet (see tig. 4.9) 

■ Pi be the period of task Tt, 

■ Ci be the execution time of 7]-, 

■ di be the deadline interval, that is, the time between a job of 7]- becoming 
available and the time after which the same job 7] has to finish execution. 

■ li be the laxity or slack, defined as 
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li = di - Ci (4.2) 

Pi 

^ 



Figure 4-9. Notation used for time intervals 

If li = 0, then Ti has to be started immediately after it beeomes executable. 

Let ^ denote the accumulated utilization for a set of n processes, that is, the 
accumulated execution times of these processes divided by their period: 



E = ( 4 . 3 ) 

Pi 

Let us assume that the execution times are equal for a number of m proces- 
sors. Obviously, equation 4.4 represents a necessary condition for a schedule 
to exist: 



H < m (4.4) 

4. 2. 3. 2 Independent tasks 

Initially, we will restrict ourselves to a description of the case in which tasks 
are independent. 

Rate monotonic scheduling. Rate monotonic (RM) scheduling [Liu 
and Layland, 1973] is probably the most well-known scheduling algorithm for 
independent periodic processes. Rate monotonic scheduling is based on the 
following assumptions (“RM assumptions”): 

1 All tasks that have hard deadlines are periodic. 

2 All tasks are independent. 

3 di = Pi, for all tasks. 

4 Ci is constant and is known for all tasks. 
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5 The time required for context switching is negligible. 

6 For a single processor and for n tasks, the following equation holds for the 
accumulated utilization ju: 

E = i-<n(2^/”-l) (4.5) 

i=l Pi 

Fig. 4.10 shows the right hand side of equation 4.5. 

1 



- 1 ) 




Figure 4-10. Right hand side of equation 4.5 
The right hand side is about 0.7 for large n: 



limnx=(2i/"-l) = ln{2) (=~0.7) (4.6) 

Then, according to the policy for rate monotonic scheduling, the priority of 
tasks is a monotonically decreasing function of their period. In other words, 
tasks with a short period will get a high priority and tasks with a long period 
will be assigned a low priority. RM scheduling is a preemptive scheduling 
policy with fixed priorities. It is possible to prove that rate monotonic schedul- 
ing is optimal for mono-processor systems. Equation 4.5 requires that some of 
the computing power of the processor is not used in order to make sure that all 
requests are honored in time. 

Fig. 4.11 shows an example of a schedule generated with RM scheduling. 

Vertical bars indicate the arrival time of the tasks. Tasks 1 to 3 have a period 
of 2, 6 and 6, respectively. Execution times are, 0.5, 2, and 1. Task 1 has 
the shortest period and, hence, the highest rate and priority. Each time task 1 
becomes available, its jobs preempt the currently active task. Task 2 has the 
same period as task 3, and neither of them preempts the other. 
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availability of jobs 




01 23456789 t 



Figure 4-11. Example of a schedule generated with RM scheduling 
RM scheduling has the following important advantages: 

■ RM scheduling is based on static priorities. This simplifies the OS and 
opens opportunities for using RM scheduling in a standard operating sys- 
tem providing fixed priorities, such as Windows NT (see Ramamritham 
[Ramamritham et ah, 1998], [Ramamritham, 2002]). 

■ If the above six RM-assumptions (see page 136) are met, all deadlines will 
be met (see Buttazzo [Buttazzo, 2002]). 

RM scheduling is also the basis for a number of formal proofs of schedulability. 

Fig. 4.12 shows a case for which not enough idle time is available to guarantee 
schedulability for RM scheduling. One task is has a period of 5, and an exe- 
cution time of 3, whereas the second task has a period of 8, and an execution 
time of 3. Task T2 is preempted several times. 




0 2 4 6 8 10 12 14 16 18 20 22 24 t 



Figure 4-12. RM schedule does not meet deadline at time 8 

For this particular case we have = | + | = which is 0.975. On the other 
hand, 2* (2 2 — 1) is about 0.828. Hence, schedulability is not guaranteed for 
RM scheduling and, in fact, the deadline is missed at time 8. We assume that 
the missing computations are not scheduled in the next period. 

However, this idle time or spare capacity of the processor is not always re- 
quired. It is possible to show that RM scheduling is also optimal, iff instead of 
equation (4.5) we have 
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H < 1 (4.7) 

provided that the period of all tasks is a multiple of the period of the highest 
priority task. 

Equations 4.5 or 4.7 provide easy means to eheek necessary conditions for 
schedulability. 



Earliest deadline first scheduling. EDF can also be applied to peri- 
odic task sets. EDE can be extended to handle the case when deadlines are 
different from the periods. 

It follows from the optimality of EDE for non-periodic schedules that EDE is 
also optimal for periodic schedules. No additional constraints have to be met 
to guarantee optimality. This implies that EDE is optimal also for the case 
of ^ = 1. Accordingly, no deadline is missed if the example of fig. 4.12 is 
scheduled with EDE (see fig. 4.13). Af lime 5, Ihe behavior is differenl from 
lhal of RM-scheduling: due fo Ihe earlier deadline of T2, if is nol preempled. 



T1 



T2 



n 



I ' I ' I ' I ' I ' I ' I ' ^ ^ ^ ^ ^ ^ 

0 2 4 6 8 10 12 14 16 18 20 22 24 t 

Figure 4-13. EDF generated schedule for the example of 4.12 



Since EDE uses dynamic priorities, it cannot be used with a standard operating 
system providing only fixed priorities. 



4. 2. 3. 3 Dependent tasks 

Scheduling dependenl lasks is more difficull lhan scheduling independenl lasks. 
The problem of deciding whefher or nol a schedule exisls for a given sel of de- 
pendenl lasks and a given deadline is NP-complele [Garey and Johnson, 1979]. 
In order lo reduce Ihe scheduling efforl, differenl slralegies are used: 

■ adding additional resources such lhal scheduling becomes easier, and 

■ partitioning of scheduling into sialic and dynamic parls. Wilh Ibis ap- 
proach, as many decisions as possible are laken al design time and only 
a minimum of decisions is left for run-time. 
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4. 2. 3. 4 Sporadic events 

We could connect sporadic events to interrupts and execute them immediately 
if their interrupt priority is the highest in the system. However, quite unpre- 
dictable timing behavior would result for all the other tasks. Therefore, special 
sporadic task servers are used which execute at regular intervals and check 
for ready sporadic tasks. This way, sporadic tasks are essentially turned into 
periodic tasks, thereby improving the predictability of the whole system. 

4.2.4 Resource access protocols 

4. 2.4.1 Priority inversion 

There are cases in which tasks must be granted exclusive access to resources 
such as global shared variables or devices in order to avoid non-deterministic 
or otherwise unwanted program behavior. Program sections during which such 
exclusive access is required are called critical sections. Operating systems 
typically provide primitives for requesting and releasing exclusive access to 
resources, also called mutex primitives. Tasks not being granted exclusive 
access have to wait until the resource is released. Accordingly, the release 
operation has to check for waiting tasks and resume the task of highest priority. 
We will call the request operation P(S) and the release operation V(S), where 
S corresponds to the particular resource requested. Critical sections should be 
short. 

For tasks with critical sections, there is a crucial effect called priority inver- 
sion. An example of priority inversion is shown in fig. 4.15. We assume that 
the priority of task T 1 is higher than that of task T2. 




Figure 4^4- Priority inversion for two tasks 

At time to, task T2 enters a critical section after requesting exclusive access 
to some resource via a operation P. At time ti, task T1 becomes ready and 
preempts T2. At time t 2 , T1 fails getting exclusive access to the resource in 
use by T2 and becomes blocked. Task T2 resumes and after some time releases 
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the resource. The release operation checks for pending tasks of higher priority 
and preempts T2. During the time T 1 has been blocked, a lower priority task 
has effectively blocked a higher priority task. This effect is called priority 
inversion. 

In the general case, priority inversion exists if some lower priority task is ef- 
fectively preventing a higher priority task from executing due to the exclusive 
use of some resource. The necessity of providing exclusive access to some 
resources is the main reason for the priority inversion effect. 

In the particular case of figure 4.14, the duration of the blocking cannot exceed 
the length of the critical section of T2. Unfortunately, there is no such upper 
bound in the general case. This can be seen from fig. 4.15. 



P(S) [sleep] resume 

\ 




Figure 4-15. Priority inversion with potentially large delay 

We assume that tasks T1 , T2 and T3 are given. T 1 has the highest priority, T2 
has a medium priority and T3 has the lowest priority. Furthermore, we assume 
that T 1 and T3 require exclusive use of some resource via operation P(S). Now, 
let T3 be in its critical section when it its preempted by T2. When T 1 preempts 
T2 and tries to use the same resource that T3 is having exclusive access of, it 
blocks and lets T2 continue. As long as T2 is continuing, T3 cannot release the 
resource. Hence, T2 is effectively blocking T 1 even though the priority of T 1 
is higher than that of T2. In this example, priority inversion continues as long 
as T2 executes. Hence, the duration of the priority inversion situation is not 
bounded by the length of any critical section. 

One of the most prominent cases of priority inversion happened in the Mars 
Pathfinder, where an exclusive use of ifs a shared memory area led fo priorify 
inversion on Mars [Jones, 1997]. 

4. 2. 4. 2 Priority inheritance 

One way of dealing wifh priorify inversion is fo use fhe priorify inherifance 
profocol. This profocol works as follows: 
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■ Tasks are scheduled according to their active priorities. Tasks with the same 
priorities are scheduled on a first-come, first-served basis. 

■ When a task T1 executes P(S) and exclusive access is already granted to 
some other task T2, then T1 will become blocked. If the priority of T2 is 
lower than that of T1, T2 inherits the priority of T1. Hence, T2 resumes 
execution. In general, tasks inherit the highest priority of tasks blocked by 
it. 

■ When a task T2 executes V(S), its priority is decreased to the highest prior- 
ity of the tasks blocked by it. If no other task is blocked by T2, its priority 
is reset to the original value. Furthermore, the highest priority task so far 
blocked on S is resumed. 

■ Priority inheritance is transitive: if T 1 blocks TO and T2 blocks T 1 , then T2 
inherits the priority of TO. 

In the example of fig. 4.15, T3 would inherif fhe priorify of T1 when T1 exe- 
cutes P(S). This would avoid fhe problem menfioned since T2 could nol pre- 
empf T3 (see fig. 4.16). 
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Figure 4.16. Priority inheritance for the example of fig. 4.15 

Priority inheritance is also used by ADA: during a rendez-vous, the priority of 
both tasks is set to their maximum. 

Priority inheritance also solved the Mars Pathfinder problem: the VxWorks 
operating system used in the pathfinder implements a flag for the calls to mutex 
primitives. This flag allows priority inheritance to be set to “on”. When the 
software was shipped, it was set to “off”. The problem on Mars was corrected 
by using the debugging facilities of VxWorks to change the flag to “on”, while 
the Pathfinder was already on the Mars [Jones, 1997]. 

While priority inheritance solves some problems, it does not solve others. 
There may be a large number of tasks having a high priority and there may 
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even be deadlocks. These problems are avoided with the more complex prior- 
ity ceiling protocol [Sha et al, 1990]. 

4.3 Embedded operating systems 
4.3.1 General requirements 

Except for very simple systems, scheduling, task switching, and I/O require 
the support of an operating system suited for embedded applications. Task 
switch (or task “dispatch” algorithms multiplex processors such that each task 
seems to have its own (virtual) processor. The following are essential features 
of real-time and embedded operating systems: 

■ Due to the large variety of embedded systems, there is also a large variety of 
requirements for the functionality of embedded OSs. Due to efficiency re- 
quirements, it is not possible to work with OSs which provide the union of 
ah functionalities. Hence, we need operating systems which can be flexibly 
tailored towards the application at hand. Conflgurability is therefore one 
of the main characteristics of embedded OSs. Conflgurability in its simplest 
form might just remove unused functions (to some extent, this can be done 
by a linker). In a more sophisticated form, conditional compilation can 
be employed (taking advantage of #if and #ifdef preprocessor commands). 
Dynamic data might be replaced by static data. Advanced compile-time 
evaluation and advanced compiler optimizations may also be useful in this 
context. Object-orientation could lead to a derivation of proper subclasses. 
Verification is a potential problem of systems with a large number of de- 
rived tailored OSs. Each and every derived OS must be tested thoroughly. 
Takada mentions this as a potential problem for eCos (an open source RTOS 
from Red Hat), comprising 100 to 200 configuration points [Takada, 2001]. 

■ There is a large variety of peripheral devices employed in embedded sys- 
tems. Many embedded systems do not have a hard disc, a keyboard, a 
screen or a mouse. There is effectively no device that needs to be sup- 
ported by all versions of the OS, except maybe the system timer. Hence, 
it makes sense to handle relatively slow devices such as discs and networks 
by using special tasks instead of integrating their drivers into the kernel of 
the OS. 

■ Protection mechanisms are not always necessary, since embedded sys- 
tems are typically designed for a single purpose and untested programs are 
hardly ever loaded. After the software has been tested, it can be assumed 
to be reliable (protection mechanisms may nevertheless still be needed for 
safety and security reasons). In most cases, embedded systems do not have 
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protection mechanisms. This also applies to input/output. In contrast to 
desktop applications, there is no desire to implement I/O instructions as 
privileged instructions and tasks can be allowed to do their own I/O. This 
matches nicely with the previous item and reduces the overhead of I/O op- 
erations. 

Example: Let switch correspond to the (memory-mapped) I/O address of 
some switch which needs to be checked by some program. We can simply 
use a 

load register, switch 

instruction to query the switch. There is no need to go through an OS 
service call, which would create a lot of overhead for saving and restoring 
the task context (registers etc.). 

■ Interrupts can be employed by any process. For desktop applications, 
it would be a serious source of unreliability to allow any process to use 
interrupts directly. Since embedded programs can be considered to be thor- 
oughly tested, since protection is not necessary and since efficient control 
over a variety of devices is required, it is possible to let interrupts directly 
start or stop tasks (e.g. by storing the tasks start address in the interrupt vec- 
tor address table). This is substantially more efficient than going through 
OS services for the same purpose. However, composability may suffer from 
this: if a specific fask is direcfly connecfed fo some inferrupf, fhen if may be 
difficulf fo add anofher fask which also needs fo be sfarfed by some even!. 

■ Many embedded sysfems are real-lime (RT) sysfems and, hence, Ihe OS 
used in Ibis sysfems must be a real-time operating system (RTOS). 

4.3.2 Real-time operating systems 

Def.: (A) real-time operating system is an operating system that supports the 

construction of real-time systems [Takada, 2001]. 

What does it take to make an OS an RTOS? The following are the three key 

requirements^ : 

■ The timing behavior of the OS must be predictable. For each service 
of the OS, an upper bound on the execution time must be guaranteed. In 
practice, there are various levels of predictability. For example, there may 
be sets of OS service calls for which an upper bound is known and for 



^This section includes information from Hiroaki Takada’s tutorial [Takada, 2001] on real-time operating 
systems at the Asian South-Pacific Design Automation Conference (ASP-DAC) in 2001 for our description 
ofRTOSs. 
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which there is not a significant variation of the execution time. Calls like 
“get me the time of the day” may fall into this class. For other calls, there 
may be a huge variation. Calls like “get me 4MB of free memory” may 
fall into this second class. In particular, the scheduling policy of RTOSs 
must be deterministic (standard Java fails badly in this respect, as no order 
of execution for a number of executable “threads” is specified). As another 
special case we mention garbage collection. In the Java context, various 
attempts have been made to provide predictable garbage collection (see 
page 58). 

There may also be times during which interrupts have to be disabled to 
avoid interference between tasks (this is a very simple way of guaranteeing 
mutual exclusion on a mono-processor system). The periods during which 
interrupts are disabled have to be quite short in order to avoid unpredictable 
delays in the processing of critical events. 

For RTOSs implementing file sysfems, if may be necessary fo implemenf 
configuous files (files stored in configuous disc areas) fo avoid unpredicfable 
disc head movemenfs. 

■ The OS must manage the timing and scheduling of tasks. Scheduling 
can be defined as mapping from fhe sef of fasks fo infervals of execution 
fime, including fhe mapping fo sfarf limes as a special case. Also, fhe OS 
possibly has to be aware of lask deadlines so fhaf fhe OS can apply appropri- 
ale scheduling techniques (fhere are, however, cases in which scheduling is 
complefely done off-line and fhe OS only needs fo provide services to sfarf 
fasks al specific limes or priority levels). 

The OS musf provide precise lime services wilh a high resolulion. Time 
services are required, for example, in order fo distinguish belween original 
and subsequenl errors. For example, Ihey can help fo identify fhe power 
planl(s) lhal are responsible for a blackoul such as fhe one on America’s 
Easl Coasl in 2003. Time services and global synchronization of clocks are 
described in defail in fhe book by Kopelz [Kopelz, 1997]. 

■ The OS must be fast. In addition to being predictable, the OS must be 
capable of supporting applications with deadlines that are fractions of a 
second. 

Each RTOS includes a so-called real-time OS kernel. This kernel manages the 
resources which are found in every system, including the processor, the mem- 
ory and the system timer. Protection mechanisms (except for dependability, 
safety or privacy reasons) need not be present. 

There are two types of RTOSs: 
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■ General purpose OS type RTOSs: for these operating systems, some 
drivers, sueh as disk, network drivers, or audio drivers are implicitly as- 
sumed to be present, and they are embedded into the kernel. The appli- 
cation software and middleware are implemented on top of the application 
programming interface, which is standard for all applications (see fig. 4.17). 



application software 



middleware 


middleware 


device driver 


device driver 





real-time kernel 



application software 
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operating system 
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Figure 4-17. Real-time kernel (left) vs. general purpose OS (right) 

■ Real-time kernel type of RTOSs: since there is hardly any standard de- 
vice in embedded systems, device drivers are not deeply embedded into 
the kernel, but are implemented on top of the kernel. Only the necessary 
drivers are included. Applications and middleware may be implemented on 
top of appropriate drivers, not on top of a standardized API of the OS. 

Major functions in the kernel include the task management, inter- task synchro- 
nization and communication, time management and memory management. 

While some RTOSs are designed for general embedded applications, others 
focus on a specific area. For example, OSEK7VDX OS focuses on automotive 
control. Due to this focus, it is a rather compact OS. 

Similarly, while some RTOSs provide a standard API, others come with their 
own, proprietary API. For example, some RTOSs are compliant with the POSIX 
RT-extension [Harbour, 1993] for UNIX, with OSEK/VDX OS, or with the 
ITRON specification developed in Japan. Many RT-kernel type of OSs have 
their own API. ITRON, mentioned in this context, is a mature RTOS which 
employs link-time configuration. 

Currently available RTOSs can further be classified info fhe following fhree 
categories [Gupfa, 1998]: 

■ Fast proprietary kernels: According to Gupta, /or complex systems, these 
kernels are inadequate, because they are designed to be fast, rather than 
to be predictable in every respect. Examples include QNX, PDOS, VCOS, 
VTRX32, VxWORKS. 

■ Real-time extensions to standard OSs: In order to take advantage of com- 
fortable main stream operating systems, hybrid systems have been devel- 
oped. Eor such systems, there is an RT-kernel running all RT-tasks. The 
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standard operating system is then executed as one of the tasks (see fig. 
4.18). 



RT-task 1 


RT-task 2 


non-RT task 1 non-RT task 2 


Standard-OS 


device driver device driver 


real-time kernel 



Figure 4-18. Hybrid OSs 

This approach has some advantages: the system can be equipped with a 
standard OS API, can have graphical user interfaces (GUIs), file systems 
etc. and enhancements to standard OSs become quickly available in the 
embedded world as well. Also, problems with the standard OS and its non- 
RT tasks do not negatively affect the RT-tasks. The standard OS can even 
crash and this would not affect the RT-tasks. On the down side, and this is 
already visible from fig. 4.18, there may be problems with device drivers, 
since the standard OS will have its own device drivers. In order to avoid 
interference between the drivers for RT-tasks and those for the other tasks, 
it may be necessary to partition devices into those handled by RT-tasks and 
those handled by the standard OS. Also, RT-tasks cannot use the services of 
the standard OS. So all the nice features like file-system access and GUIs 
are normally not available to those tasks, even though some attempts may 
be made to bridge the gap between the two types of tasks without loosing 
the RT-capability. RT-Linux is an example of such hybrid OSs. 

According to Gupta, trying to use a version of a standard OS is not the 
correct approach because too many basic and inappropriate underlying 
assumptions still exist such as optimizing for the average case (rather than 
the worst case), ... ignoring most if not all semantic information, and in- 
dependent CPU scheduling and resource allocation. Indeed, dependences 
between tasks are not very frequent for most applications of standard oper- 
ating systems and are therefore frequently ignored by such systems. This 
situation is different for embedded systems, since dependences between 
tasks are quite common and they should be taken into account. Unfortu- 
nately, this is not always done if extensions to standard operating systems 
are used. Furthermore, resource allocation and scheduling are rarely com- 
bined for standard operating systems. However, integrated resource alloca- 
tion and scheduling algorithms are required in order to guarantee meeting 
timing constraints. 

■ There is a number of research systems which aim at avoiding the above 
limitations. These include Melody [Wedde and Lind, 1998], and (accord- 
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ing to Gupta [Gupta, 1998]) MARS, Spring, MARUTI, Arts, Hartos, and 
DARK. 

Takada [Takada, 2001] mentions low overhead memory proteetion, temporal 
proteetion of eomputing resourees (how to avoid tasks from eomputing for 
longer periods of time than initially planned), RTOSs for on-ehip multipro- 
eessors (espeeially for heterogenous multiproeessors and multi-threaded pro- 
eessors) and support for eontinuous media and quality of serviee eontrol as 
researeh issues. 

Due to the potential growth in the embedded system market, vendors of stan- 
dard OSs are aetively trying to sell variations of their produets (like Embedded 
Windows XP and Windows CE [Mierosoft Ine., 2003]) and obtain market 
shares from traditional vendors sueh as Wind River Systems [Wind River Sys- 
tems, 2003]. 

4.4 Middleware 

4.4.1 Real-time data bases 

Data bases provide a convenient and structured way of storing and accessing 
information. Accordingly, data bases provide an API for writing and reading 
information. A sequence of read and write operations is called a transaction. 
Transactions may have to aborted for a variety of reasons: there could be hard- 
ware problems, deadlocks, problems with concurrency control etc. A frequent 
requirement is that transactions do not affect the state of the data base unless 
they have been executed to their very end. Hence, changes caused by transac- 
tions are normally not considered to be final until they have been committed. 
Most transactions are required to be atomic. This means that the end result 
(the new state of the data base) generated by some transaction must be the 
same as if the transaction has been fully completed or not at all. Also, the 
data base state resulting from a transaction must be consistent. Consistency 
requirements include, for example, that the values from read requests belong- 
ing to the same transaction are consistent (do not describe a state which never 
existed in the environment modeled by the data base). Eurthermore, to some 
other user of the data base, no intermediate state resulting from a partial exe- 
cution of a transaction must be visible (the transactions must be performed as 
if they were executed in isolation). Einally, the results of transactions should 
be persistent. This property is also called durability of the transactions. To- 
gether, the four properties printed in bold are known as ACID properties (see 
the book by Krishna and Shin [Krishna and Shin, 1997], chapter 5). 

Eor some data bases, there are soft real-time constraints. Eor example, time- 
constraints for airline reservation systems are soft. In contrast, there may also 
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be hard constraints. For example, automatic recognition of pedestrians in auto- 
mobile applications and target recognition in military applications must meet 
hard real-time constraints. The above requirements make it very difficult to 
guarantee hard real-time constraints. For example, transactions may be aborted 
various times before they are finally committed. For all data bases relying on 
demand paging and on hard discs, the access times to discs are hardly pre- 
dictable. Possible solutions include main memory data bases. Embedded data 
bases are sometimes small enough to make this approach feasible. In other 
cases, it may be possible to relax the ACID requirements. For further informa- 
tion, see the book by Krishna and Shin. 

4.4.2 Access to remote objects 

There are special software packages which facilitate the access to remote ser- 
vices. CORBA(g)(Common Object Request Broker Architecture) [Object Man- 
agement Group (OMG), 2003] is one example of this. With CORBA, remote 
objects can be accessed through standardized interfaces. Clients are commu- 
nicating with local stubs, imitating the access to the remote objects. These 
clients send information about the object to be accessed as well as parameters 
(if any) to the Object Request Broker ORB (see fig. 4.19). The ORB then de- 
termines the location of the object to be accessed and sends information via 
a standardized protocol, e.g. the HOP protocol to where the object is located. 
This information is then forward to the object via a skeleton and the informa- 
tion requested from the object (if any) is returned using the ORB again. 
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Figure 4-19. Access to remote objects using CORBA 

Standard CORBA does not provide the predictability required for real-time 
applications. Therefore, a separate real-time CORBA (RT-CORBA) standard 
has been defined [Object Management Group (OMG), 2002]. A very essential 
feature of RT-CORBA is to provide end-to-end predictability of timeliness in a 
fixed priority system. This involves respecting thread priorities between client 
and server for resolving resource contention, and bounding the latencies of op- 
eration invocations. One particular problem of real-time systems is that thread 
priorities might not be respected when threads obtain mutually exclusive ac- 
cess to resources. This so-called priority inversion problem (see page 140) has 
to be addressed in RT-CORBA. RT-CORBA includes provisions for bounding 
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the time during which such priority inversion can happen. RT-CORBA also 
includes facilities for thread priority management. This priority is independent 
of the priorities of the underlying operating system, even though it is compati- 
ble with the real-time extensions of the POSIX standard for operating systems 
[Harbour, 1993]. The thread priority of clients can be propagated to the server 
side. Priority management is also available for primitives providing mutually 
exclusive access to resources. The priority inheritance protocol (described on 
page 141) must be available in implementations of RT-CORBA. Pools of pre- 
existing threads avoid the overhead of thread creation and thread-construction. 

As an alternative to CORBA, the message passing interface (MPI) can be used 
for communicating between different processors. In order to apply the MPI- 
style of communication to real-time systems, a real-time version of MPI, called 
MPI/RT has been defined [MPI/RT forum, 2001]. MPI-RT does not cover 
some of the issues covered in RT-CORBA, such as thread creation and termi- 
nation. MPI/RT is conceived as a potential layer between the operating system 
and standard (non real-time) MPI. 




Chapter 5 



IMPLEMENTING EMBEDDED SYSTEMS: 
HARDWARE/SOFTWARE CODESIGN 



Once the specification has been completed, design activities can start. This is 
consistent with the simplified design information flow (see fig. 5.1). 




Figure 5.1. Simplified design information flow 

If is a characferisfic of embedded sysfems fhaf bofh hardware and soflware 
have fo be considered during fheir design. Therefore, Ibis fype of design is 
also called hardware/software codesign. The overall goal is fo find fhe righf 
combination of hardware and soffware resulfing in fhe mosf efficienf producf 
meeting fhe specification. Therefore, embedded sysfems cannof be designed 
by a synfhesis process faking only fhe behavioral specification info accounf. 
Rafher, available componenfs have fo be accounfed for. There are also ofher 
reasons for Ibis consfrainf: in order fo cope wifh fhe increasing complexify of 
embedded sysfems and fheir sfringenf time-to-market requirements, reuse is 
essentially unavoidable. This led to the term platform-based design: 



151 









152 



EMBEDDED SYSTEM DESIGN 



A platform is a family of architectures satisfying a set of constraints imposed 
to allow the reuse of hardware and software components. However, a hard- 
ware platform is not enough. Quick, reliable, derivative design requires using 
a platform application programming interface (API) to extend the platform to- 
ward application software. In general, a platform is an abstraction layer that 
covers many possible refinements to a lower level. Platform-based design is 
a meet-in-the-middle approach: In the top-down design flow, designers map 
an instance of the upper platform to an instance of the lower, and propagate 
design constraints [Sangiovanni-Vincentelli, 2002]. 

The mapping is an iterative proeess in which performance evaluation tools 
guide the next assignment. Fig. 5.2 [Herrera et ah, 2003a] visualizes this ap- 
proach. 






Figure 5.2. Platform-based design 

Design activities have to take the existence of available platforms into account. 

There is actually a large number of design activities, only some of which can be 

presented here and references to available platforms will not always be explicit. 

Design activities that are presented include: 

■ Task level concurrency management: This activity is concerned with 
identifying the tasks that should be present in the final embedded system. 
These tasks may be different from those that were included in the speci- 
fication, since there are good reasons for merging and splitting tasks (see 
section 5.5.1). 

■ High-level transformations: It has been found that there are many op- 
timizing high-level transformations that can be applied to specifications. 
For example, loops can be interchanged so that accesses to array compo- 
nents become more local. Also, floating point arithmetic can frequently 
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be replaced by fixed-point arithmetic without any significant loss in qual- 
ity. These high-level transformations are typically beyond the capabilities 
of available compilers and have to be applied before any compilation is 
started. 

■ Hardware/software partitioning: We assume that in the general case, 
some function has to be performed by special hardware due to increas- 
ing computational requirements [De Man, 2002]. Hardware/software par- 
titioning is the activity in charge of mapping operations to either hardware 
or software. 

■ Compilation: Those parts of the specification that are mapped to software 
have to be compiled. Efficiency of the generated code is improved if the 
compiler exploits knowledge about the underlying processor (and possi- 
bly the memory) hardware. Therefore, there are special “hardware-aware” 
compilers for embedded systems. 

■ Scheduling: Scheduling (mapping of operations to start times) has to be 
performed in several contexts. Schedules have to be approximated during 
hardware/software partitioning, during task level concurrency management 
and possibly also during compilation. Precise schedules can be obtained 
for the hnal code. 

■ Design space exploration: In most of the cases, several designs meet the 
specihcations. Design space exploration is the process of analyzing the set 
of possible designs. Among those designs that meet the specihcations, one 
design has to be selected. 

Particular design hows may use these activities in different orders. There is no 
standard set of design activities. We will briehy mention some orders that are 
being used at the end of this chapter (see page 190) in order to provide a some 
ideas on how actual design hows can look like. 

5.1 Task level concurrency management 

As mentioned on page 52, the task graphs’ granularity is one of their most 
important properties. Even for hierarchical task graphs, it may be useful to 
change the granularity of the nodes. The partitioning of specihcations into 
tasks or processes does not necessarily aim at the maximum implementation 
efficiency. Rather, during the specihcation phase, a clear separation of con- 
cerns and a clean software model are more important than caring about the 
implementation too much. Hence, there will not necessarily be a one-to-one 
correspondence between the tasks in the specihcation and those in the imple- 
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mentation. This means that a regrouping of tasks may be advisable. Sueh a 
regrouping is indeed feasible by merging and splitting of tasks. 

Merging of task graphs ean be performed whenever some task T i is the im- 
mediate predeeessor of some other task Tj and if Tj does not have any other 
immediate predeeessor (see fig. 5.3 with Ti = T3 and Tj = T4). This trans- 
formation can lead to a reduced overhead of context-switches if the node is 
implemented in software, and it can lead to a larger potential for optimizations 
in general. 





Figure 5. 3. Merging of tasks 

On the other hand, splitting of tasks may be advantageous for the following 
reasons: 

Tasks may be holding resources (like large amounts of memory) while they 
are waiting for some input. In order to maximize the use of these resources, it 
may be best to constrain the use of these resources to the time intervals during 
which these resources are actually needed. In fig. 5 . 4 , we are assuming fhaf 
fask T 2 requires some inpuf somewhere in ifs code. In fhe inifial version, fhe 
execution of fask T 2 can only sfarf if Ibis inpuf is available. We can splif fhe 
node info T2 and T|* such fhaf fhe inpuf is only required for fhe execution of 
T2*. Now, T2 can sfarf earlier, resulfing in more scheduling freedom. This 
improved scheduling freedom mighf improve resource ufilizafion and could 
even enable meeting some deadline. If may also have an impacf on fhe memory 
required for dafa sforage, since T2 could release some of ifs memory before 
ferminafing and fhis memory could be used by ofher fasks while T|* is waiting 
for inpuf. 




Figure 5.4 



Splitting of tasks 
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One might argue that the tasks should release resources like large amounts 
of memory anyway before waiting for input. However, the readability of the 
original specification could suffer from caring about implementation issues in 
an early design phase. 

Quite complex transformations of the specifications can be performed with 
a Petri-net based technique described by Cortadella et al. [Cortadella et ah, 
2000]. Their technique starts with a specification consisting of a set of tasks 
described in a language called FlowC. FlowC extends C with process headers 
and intertask communication specified in fhe form of READ- and WRITE- 
funcfion calls. Eig. 5.5 shows an inpuf specificalion using ElowC. 



IN 



o 



COEF 



OUT 



PROCESS GetData 




PROCESS FllterfInPort DATA, 


(InPort IN, OutPort DATA){ 




InPort COEF, OutPort OUT){ 


float sample, sum; int i; 




float c,d; Intj; 


while (1) { 




c=1; j=0; 


sum=0; 




whlle(l) { 


for (1=0; i<N; l++){ 


DATA 


SELECT(DATA,COEF){ 


READ(IN,sample,1) 




case DATA: READ (DATA,d,l); 


sum+=sample; 




if (j==N){]=0; d=d*c; WRITE(OUT,d,f ); 


WRITE(DATA,sample,1) 




} else ]++; 


} 




break; 


WRITE(DATA,sum/N,1); 




case COEF: READ(COEF,c,1); break; 


}} 




}}} 



Figure 5. 5. System specification 

The example uses inpuf porfs IN and COEF, as well as oufpuf porf OUT. Poinf- 
fo-poinf interprocess communicafion befween processes is realized fhrough a 
uni-direcfional buffered channel DATA. Task GelDafa reads dafa from fhe en- 
vironmenf and sends if fo channel DATA. Each lime N samples have been senf, 
fheir average value is also sen! via fhe same channel. Task Filter reads N values 
from the channel (and ignores them) and then reads the average value, multi- 
plies the average value by c (c can be read in from port COEF) and writes the 
result to port OUT. The third parameter in READ and WRITE calls is the num- 
ber of items to be read or written. READ calls are blocking, WRITE calls are 
blocking if the number of items in the channel exceed a predefined Ihreshold. 
The SELECT slalemenl has fhe same semantics as fhe sfalemenl wilh fhe same 
name in ADA (see page 57): execution of Ihis lask is suspended unfil inpuf 
arrives from one of fhe porfs. This example meefs all criferia for splitting lasks 






156 



EMBEDDED SYSTEM DESIGN 



that were mentioned in the eontext of fig. 5.4. Both tasks will be waiting for 
input while oeeupying resourees. Effieieney eould be improved by restruetur- 
ing these tasks. However, the simple splitting of fig. 5.4 is not sufficient. The 
technique proposed by Cortadella et al. is a more comprehensive one. Using 
their technique, FlowC -programs are first translated into (extended) Petri-nets. 
Petri-nets for each of the tasks are then merged into a single Petri-net. Using 
results from Petri-net theory, new tasks are then generated. Fig. 5.6 shows a 
possible new task structure. 




Figure 5. 6. Generated software tasks 

In this new task structure, there is one task which performs all initializations: 
In addition, there is one task for each of the input ports. An efficient imple- 
mentation would raise interrupts each time new input is received for a port. 
These should be a unique interrupt per port. The tasks could then be started 
directly by those interrupts, and there would be no need to invoke the operating 
system for that. Communication can be implemented as a single shared global 
variable (assuming a shared address space for all tasks). The overall operating 
system overhead would be quite small, if required at all. 

The code for task Tin shown in fig. 5.6 is fhe one fhaf is generafed by fhe Pefri 
nef-based infer-fask opfimizafion of fhe fask sfrucfure. If should be furlher 
opfimized by infra-fask opfimizafions, since fhe fesf performed for fhe firsl if- 
sfafemenf is always false (j is equal fo i-1 in Ibis case, and i and j are resef fo 0 
whenever i becomes equal fo N). For fhe second if-sfalemenf, fhe fesf is always 
frue, since fhis poinf of confrol is only reached if i is equal fo N and i is equal fo 
j whenever label LO is reached. Also, fhe number of variables can be reduced. 
The following is an optimized version Tin: 
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Tin 0 { 

READ (IN, sample, 1); 
sum += sample; i++; 

DATA = sample; d = DATA; 

LO: if (i < N) return; 

DATA = sum/N; d = DATA; 
d = d*c; WRITE(OUT,d,1); 
sum = 0; i = 0; 
return; 

} 

The optimized version of Tin could be generated by a very clever compiler. 
Unfortunately, hardly any of today’s compilers will perform this optimization. 
Nevertheless, the example shows the type of transformations required for gen- 
erating “good” task structures. For more details about the task generation, refer 
to Cortadella et al. [Cortadella et ah, 2000]. 

Optimizations similar to the one just presented are described in the book by 
Thoen [Thoen and Catthoor, 2000]. A list of IMEC’s publications on task 
concurrency management is available from IMEC’s web site [IMEC Desics 
group, 2003]. 

5.2 High-level optimizations 

There are many high-level optimizations which can potentially improve the 
efficiency of embedded software. 

5.2.1 Floating-point to fixed-point conversion 

Eloating-point to fixed-point conversion is a commonly used technique. This 
conversion is motivated by the fact that many signal processing standards (such 
as MPEG-2 or MPEG-4) are specified in the form of C-programs using floating- 
point data types. It is left to the designer to find an efficient implementation of 
these standards. 

Eor many signal processing applications, it is possible to replace floating-point 
numbers with fixed-point numbers (see page 109). The benefits may be signif- 
icant. Eor example, a reduction of the cycle count by 75% and of the energy 
consumption by 76% has been reported for an MPEG-2 video compression al- 
gorithm [Hiils, 2002]. However, some loss of precision is normally incurred. 
More precisely, there is a tradeoff between the cost of the implementation and 
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the quality of the algorithm (evaluated e.g. in terms of the so-ealled signal- 
to-noise ratio (SNR)). For small word-lengths, the quality may be seriously 
affeeted. Consequently, floating-point data types may be replaeed by fixed- 
point data types, but the quality loss has to be analyzed. This replaeement was 
initially performed manually. However, it is a very tedious and error-prone 
proeess. 

Therefore, researehers have tried to support this replaeement with tools. One 
of the most well-known tools is FRIDGE (fixed-point programming design 
environment) [Willems et ah, 1997], [Keding et ah, 1998]. FRIDGE tools 
have been made available eommercially as part of the Synopsys System Studio 
tool suite [Synopsys, 2005]. 

In ERIDGE, the design proeess starts with an algorithm deseribed in C, inelud- 
ing floating-point numbers. This algorithm is then eonverted to an algorithm 
described in fixed-C. Eixed-C extends C by two fixed-point data types, using 
the type definition features of C-i-i-. Eixed-C is a subset of C-i-i- and provides 
two data types fixed and Fixed. Fixed-point data types can be declared very 
much like other variables. The following declaration declares a scalar vari- 
able, a pointer, and an array to be fixed-point data types. 

fixed a,*b,c[8] 

Providing parameters of fixed-point data types can (but does not have to) be 
delayed until assignment time: 

a=fixed(5,4,s,wt,*b) 

This assignment sets the word-length parameter of a to 5 bits, the fractional 
word-length to 4 bits, sign to present (s), overflow handling to wrap-around 
(w), and the rounding mode to truncation (t). The parameters for variables that 
are read in an assignment are determined by the assignment(s) to those vari- 
ables. The data type Fixed is similar to fixed, except that a consistency check 
between parameters used in the declaration and those used in the assignment 
is performed. For every assignment to a variable, parameters (including the 
word-length) can be different. This parameter information can be added to the 
original C-program before the application is simulated. Simulation provides 
value ranges for all assignments. Based on that information, FRIDGE adds 
parameter information to all assignments. ERIDGE also infers parameter in- 
formation from the context. Eor example, the maximum value of additions is 
considered to be the sum of the arguments. Added parameter information can 
be either based on simulations or on worst case considerations. Being based 
on simulations, ERIDGE does not necessarily assume the worst case values 
that would result from a formal analysis. The resulting C-i-i-program is simu- 
lated again to check for the quality loss. The Synopsys version of Eridge uses 
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SystemC fixed-point data types to express generated data type information. 

Aecordingly, SystemC can be used for simulating fixed-point data types. 

An analysis of the tradeoffs between the additional noise introduced and the 

word-length needed was proposed by Shi and Brodersen [Shi and Brodersen, 

2003] and also by Menard et al. [Menard and Sentieys, 2002]. 

5.2.2 Simple loop transformations 

There is a number of loop transformations that can be applied to specifications. 

The following is a list of standard loop transformations: 

■ Loop permutation: Consider a two-dimensional array. According to the 
C standard [Kernighan and Ritchie, 1988], two-dimensional arrays are laid 
out in memory as shown in fig. 5.7. Adjacent index values of the second 
index are mapped to a contiguous block of locations in memory. This lay- 
out is called row-major order [Muchnick, 1997]. Note that the layout for 
arrays is different for FORTRAN: Adjacent values of the first index are 
mapped to a contiguous block of locations in memory (column major or- 
der). Publications describing optimizations for FORTRAN can therefore 
be confusing. 




Figure 5. 1. Memory layout for two-dimensional array p[j][k] in C 



For row-major layout, it is usually beneficial to organize loops such that 
the last index corresponds to the innermost loop. A corresponding loop 
permutation is shown in the following example: 

for (k=0; k<=m; k++) for (j=0; j<=n; j++) 

for (j=0; j<=n; j++) => for (k=0; k<=m; k++) 

pG][k] = ... pU][k] = ... 

Such permutations may have a positive effect on the reuse of array elements 
in the cache, since the next iteration of the loop body will access an adja- 
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cent location in memory. Caches are normally organized such that adjacent 
locations can be accessed significantly faster than locations that are further 
away from the previously accessed location. 

■ Loop fusion, loop fission: There may be cases in which two separate 

loops can be merged, and there may be cases in which a single loop is split 
into two. The following is an example: 

for(j=0; j<=n; j++) for (j=0; j<=n; j++) 

PU]=- ; {pD]=- ; 

for (j=0; j<=n; j++) p[j]= p[j] + ...} 

P[j]= PD] + - 

The left version may be advantageous if the target processor provides a 
zero-overhead loop instruction which can only be used for small loops. 
The right version might lead to an improved cache behavior (due to the 
improved locality of references to array p), and also increases the potential 
for parallel computations within the loop body. As with many other trans- 
formations, it is difficult to know which of the transformations leads to the 
best code. 

■ Loop unrolling: Loop unrolling is a standard transformation creating sev- 
eral instances of the loop body. The following is an example in which the 
loop is being unrolled once: 

for (j=0; j<=n; j++) for (j=0; j<=n; j+=2) 

PU]=... ; => {pD]=- ; 

p[j+1]=...} 

The number of copies of the loop is called the unrolling factor. Unrolling 
factors larger than two are possible. Unrolling reduces the loop overhead 
(less branches per execution of the original loop body) and therefore typ- 
ically improve the speed. As an extreme case, loops can be completely 
unrolled, removing control overhead and branches altogether. However, 
unrolling increases code size. Unrolling is normally restricted to loops with 
a constant number of iterations. 

5.2.3 Loop tiling/blocking 

It can be observed that the speed of memories is increasing at a slower rate 
than that of processors. Since small memories are faster than large memories 
(see page 118), the use of memory hierarchies may be beneficial. Possible 
“small” memories include caches and scratch-pad memories. A significant 
reuse factor for the information in those memories is required. Otherwise the 
memory hierarchy cannot be efficiently exploited. 
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Reuse effects can be demonstrated by an analysis of the following example. 
Let us consider matrix multiplication for arrays of size N x N [Lam et al, 
1991]: 

for (i=1; i<=N; i++) 
for(k=1; k<=N; k++){ 
r=X[i,kj; /* to be allocated to a register*/ 
for (j=l; j<=N; j++) 

Z[l,j] += r* Y[k,j] 

} 

Let us consider access patterns for this code. The same element X[l,k] is used 
by all iterations of the innermost loop. Compilers will typically be capable of 
allocating this element to a register and reuse it for every execution of the in- 
nermost loop. We assume that array elements are allocated in row major order 
(as it is standard for C). This means that array elements with adjacent row (right 
most) index values are stored in adjacent memory locations. Accordingly, ad- 
jacent locations of Z and Y are fetched during the iterations of the innermost 
loop. This property is beneficial if the memory system uses prefetching (when- 
ever a word is loaded into the cache, loading of the next word is started as well). 
Fig. 5.8 shows access patterns for this code. 



Z 



■ j 

□ k 

□ i 



Figure 5.8. Access pattern for unblocked matrix multiplication 

For one iteration of the innermost loop, the black areas of arrays Z and Y are 
accessed (and loaded into the cache). Whether or not the same information is 
still in the cache for the next iteration of the middle or outermost loops depends 
on the size of the cache. In the worst case (if N is large or the cache is small), 
the information has to be reloaded for every execution of the innermost loop 
and cache elements are not reused. The total number of memory references 
may be as large as 2 (for references fo Z and Y) + (for references fo X). 

Research on scienfific computing led fo fhe design of blocked or tiled algo- 
rithms [Xue, 2000], which improve the locality of references. The following 
is a tiled version of the above algorithm: 
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for(kk=1; kk<= N; kk+=B) 
for(jj=1;jj<=N;jj+=B) 
for (i=1; i<= N; i++) 
for (k=kk; k<= rmin(kk+B-1,N); k++){ 
r=X[i][k]; T to be allocated to a register*/ 
foi" (j=jj; j<= min(jj+B-1, N); j++) 

Z[i][j] += r* Y[k][j] 

} 

Fig. 5.9 shows the corresponding access pattern. 




Figure 5. 9. Access pattern for tiled/blocked matrix multiplication 

The innermost loop is now restricted so that it accesses less array elements 
(those shown in black). If a proper blocking factor is selected, the elements 
are still in the cache when the next iteration of the innermost loop starts. The 
blocking factor B can be chosen such that the elements of the innermost loops 
lit into the cache. In particular, it can be chosen such that a B x B sub-matrix 
of Y tits into the cache. This corresponds to a reuse factor of B for Y, since 
the elements in the sub-matrix are accessed B times for each iteration of i. 
Also, a block of B row elements of Z should tit into the cache. These will 
then be reused during the iterations of k, resulting in a reuse factor of B for 
Z as well. This reduces the overall number of memory references to at most 
2 N^/B (tor reterences to Z and Y) + (tor reterences to X). In practice, the 
reuse factor may be less than B. Optimizing the reuse factor has been an area of 
comprehensive research. Initial research focused on the performance improve- 
ments that can be obtained by tiling. Performance improvements for matrix 
multiplication by a factor between 3 and 4.3 was reported by Lam [Lam et ah, 
1991]. Possible improvements are expected to increase with the increasing 
gap between processor and memory speeds. Tiling can also reduce the energy 
consumption of memory systems [Chung et ah, 2001]. 
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5.2.4 Loop splitting 

Next, we discuss loop splitting as another optimization that can be applied 
before compiling the program. Potentially, this optimization could also be 
added to compilers. 

Many image processing algorithms perform some kind of filtering. This fil- 
tering consists of considering the information about a certain pixel as well as 
that of some of its neighbors. Corresponding computations are typically quite 
regular. However, if the considered pixel is close to the boundary of the image, 
not all neighboring pixels exist and the computations have to be modified. In a 
slraighfforward description of fhe filfering algorifhm, fhese modificafions may 
resulf in fesfs being performed in fhe innermosf loop of fhe algorifhm. A more 
efficienf version of fhe algorifhm can be generafed by splitting fhe loops such 
fhaf one loop body handles fhe regular cases and a second loop body handles 
fhe exceptions. Figure 5.10 is a graphical represenfafion of fhis fransformafion. 
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Figure 5. 1 0. Splitting image processing into regular and special cases 

Performing fhis loop splitting manually is a very difficull and error-prone pro- 
cedure. Falk el al. have published an algorifhm [Falk and Marwedel, 2003] fo 
perform a procedure which also works for larger dimensions aufomalically. If 
is based on a sophisficaled analysis of accesses fo array elemenls in loops. Op- 
limized solulions are generafed using genetic algorilhms. The following code 
shows a loop nesf from fhe MPEG-4 sfandard performing molion esfimalion: 

for (z=0; z<20; z++) 
for (x=0; x<36; x++) {x1=4*x; 
for (y=0; y<49; y-H-) {y1=4*y; 
for (k=0; k<9; k-i-i-) {x2=x1+k-4; 
for (1=0; l<9; ) {y2=ylH-l-4; 
for (i=0; i<4; i-H-) {x3=x1+i; x4=x2-ri; 
for (j=0; j<4;j-H-) {y3=y1+j; y4=y2-rj; 
if (x3<0 II 35<x3||y3<0||48<y3) 
fhen_block_1 ; else else_block_1 ; 
if (x4<0|| 35<x4||y4<0||48<y4) 
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then_block_2; else else_block_2; 

}}}}}} 

Using Falk’s algorithm, this loop nest is transformed into the following one: 
for (z=0; z<20; z++) 
for (x=0; x<36; x++) {x1=4*x; 
for (y=0; y<49; y++) 
if (x>=10||y>=14) 
for (; y<49; y++) 
for (k=0; k<9; k++) 
for (l=0; K9;l++ ) 
for (i=0; i<4; i++) 
for (j=0; j<4;j++) { 
then_block_1 ; then_block_2} 
else {y1=4*y; 

for (k=0; k<9; k++) {x2=x1+k-4; 
for (l=0; l<9; ) {y2=y1+l-4; 
for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; 
for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; 
if (0 II 35<x3 ||0|| 48<y3) 
then_block_1 ; else else_block_1 ; 
if (x4<0|| 35<x4||y4<0||48<y4) 
then_block_2; else else_block_2; 

}}}}}} 

Instead of complicated tests in the innermost loop, we now have a splitting 
if-statement after the third for-loop statement. All regular cases are handled 
in the then-part of this statement. The else-part handles the relatively small 
number of remaining cases. 

Fig. 5.11 shows the number of cycles that can be saved by loop nest splitting 
for various applications and target processors. 

For the motion estimation algorithm, cycle counts can be reduced by up to 
about 75 % (to 25 % of the original value). Obviously, substantial savings are 
possible. This potential should certainly not be ignored. 
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Runtime 




Figure 5.11. Results for loop splitting 

5.2.5 Array folding 

Some embedded applications, especially in the multimedia domain, include 
large arrays. Since memory space in embedded systems is limited, options 
for reducing the storage requirements of arrays should be explored. Fig. 5.12 
represents the addresses used by five arrays as a function of time. At any par- 
ticular time only a subset of array elements is needed. The maximum number 
of elements needed is called the address reference window [De Greef et al., 
1997a]. In fig. 5.12, fhis maximum is indicafed by a double-headed arrow. 




Figure 5.12. Reference patterns for arrays 
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A classical memory allocation for arrays is shown in fig. 5.13 (left). Each array 
is allocated the maximum of the space it requires during the entire execution 
time (if we consider global arrays). 





Figure 5.13. Unfolded (left) and inter-array folded (right) arrays 

One of the possible improvements, inter-array folding, is shown in fig. 5.13 
(right). Arrays which are not needed at overlapping time intervals can share 
the same memory space. A second improvement, intra-array folding [De Greef 
et ah, 1997b], is shown in fig. 5.14. It takes advantage of the limited sets of 
components needed within an array. Storage can be saved at the expense of 
more complex address computations. 




Figure 5.14. Intra-array folded arrays 



The two kinds of foldings can also be combined. 
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Other forms of high-level transformations have been analyzed by Chung, Be- 
nin! and De Michel! [Chung et ah, 2001], [Tan et ah, 2003]. There are many 
additional contributions in this domain in the compiler community. 

In particular, function inlining ^ replaces function calls by the code of the called 
function. This transformation improves the speed of the code, but results in an 
increase in the code size. Increased code sizes may be a problem in SoC tech- 
nologies. Traditional in-lining techniques rely on the user identifying functions 
to be inlined. This is a problem in systems on a chip, since the size of the in- 
struction memory is very critical for such systems. Hence, it is important to be 
able to constrain the size of the instruction memory and to let design tools find 
out automatically which of the functions should be in-lined for a certain size 
of the memory. Known approaches for this include techniques by Teich [Te- 
ich et ah, 1999], Leupers et al. [Leupers and Marwedel, 1999], and [Palkovic 
et ah, 2002]. These techniques can be either integrated into a compiler or can 
be applied as a source-to-source transformation before using any compiler. 

5.3 Hardware/software partitioning 
5.3.1 Introduction 

During the design process, we have to solve the problem of implementing the 
specification either in hardware or in the form of programs running on proces- 
sors. This section describes some of the techniques for this mapping. Applying 
these techniques, we will be able to decide which parts have to be implemented 
in hardware and which in software. 

By hardware/software partitioning we mean the mapping of task graph nodes 
to either hardware or software. A standard procedure for embedding hard- 
ware/software partitioning into the overall design flow is shown in fig. 5.15. 
We sfarf from a common represenfafion of fhe specificalion, e.g. in fhe form of 
fask graphs and informalion abouf fhe platform. 

For each of fhe nodes of fhe fask graphs, we need informalion concerning fhe 
efforl required and fhe benefils received from choosing a cerlain implemen- 
lalion of Ihese nodes. For example, execution times musl be predicted (see 
page 127). If is very hard lo predicf times required for communication. Nev- 
ertheless, two tasks requiring a very high communication bandwidth should 
preferably be mapped to the same components. Iterative approaches are used 
in many cases. An initial solution to the partitioning problem is generated, 
analyzed and then improved. 



*The concept of inlining is assumed to be known to the reader from programming courses. 
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Figure 5.15. General view of hardware/software partitioning 

Some approaches for partitioning are restricted to mapping task graph nodes 
either to special purpose hardware or to software running on a single processor. 
Such partitioning can be performed with bipartitioning algorithms for graphs 
[Kuchcinski, 2002]. 

More elaborate partitioning algorithms are capable of mapping graph nodes 
to multi-processor systems and hardware. In the following, we will describe 
how this can be done using a standard optimization technique from operations 
research, integer programmin g . Our presentation is based on a simplified 
version of the optimization proposed for the codesign topi COOL [Niemann, 
1998]. 

5.3.2 COOL 

For COOL, the input consists of three parts: 

■ Target technology: This part of the input to COOL comprises information 
about the available hardware platform components. COOL supports mul- 
tiprocessor systems, but requires that all processors are of the same type, 
since it does not include automatic or manual processor selection. The 
name of the processor used (as well as information about the correspond- 
ing compiler) must be included in this part of the input to COOL. As far 
as the application-specific hardware is concerned, fhe information musf be 
sufficienf for sfarfing aufomafic hardware synfhesis wifh all required pa- 
ramefers. In particular, information abouf fhe technology library musf be 
given. 

■ Design constraints: The second part of the input comprises design con- 
straints such as the required throughput, latency, maximum memory size, 
or maximum area for application-specific hardware. 
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■ Behavior: The third part of the input describes the required overall behav- 
ior. Hierarchical task graphs are used for this. We can think of, e.g. using 
the hierarchical task graph of fig. 2.46 for this. 

COOL uses two kinds of edges: communication edges and timing edges. 
Communication edges may contain information about the amount of infor- 
mation to be exchanged. Timing edges provide timing constraints. COOL 
requires the behavior of each of the leaf nodes^ of the graph hierarchy to be 
known. COOL expects this behavior to be specified in VHDL^. 

For partitioning, COOL uses the following steps: 

1 Translation of the behavior into an internal graph model. 

2 Translation of the behavior of each node from VHDL into C. 

3 Compilation of all C programs for the selected target processor, compu- 
tation of the resulting program size, estimation of the resulting execution 
time. If simulations are used for the latter, simulation input data must be 
available. 

4 Synthesis of hardware components: For each leaf node, application- 
specific hardware is synfhesized. Since quife a number of hardware com- 
ponenfs may have fo be synfhesized, hardware synfhesis should nol be too 
slow. If was found fhaf commercial synfhesis fools focusing on gale level 
synfhesis can be loo slow fo be useful for COOL. However, high-level syn- 
Ihesis fools working al Ihe regisfer-lransfer-level (using adders, regislers, 
and mulliplexer as componenls, ralher lhan gales) provide sufficienl syn- 
Ihesis speed. Also, such tools can provide sufficienlly precise values for 
delay limes and required silicon area. In Ihe aclual implemenfalion, Ihe 
OSCAR high-level synfhesis fool [Landwehr and Marwedel, 1997] is used. 

5 Flattening the hierarchy: The next step is to extract a flat task graph 
from the hierarchical flow graph. Since no merging or splitting of nodes 
is performed, the granularity used by the designer is maintained. Cost 
and performance information gained from compilation and from hardware 
synthesis are added to the nodes. This is actually one of the key ideas of 
COOL: the information required for hardware/software partitioning is 
precomputed and it is computed with good precision. This information 
forms the basis for generating cost-minimized designs meeting the design 
constraints. 



^See page 20 for a definition of this term. 

^In retrospect, we now know that C should have been used for this, as this choice would have made the 
partitioning for many standards described in C easier. 
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6 Generating and solving a mathematical model of the optimization prob- 
lem: COOL uses integer programming (IP) to solve the optimization prob- 
lem. A commercial IP solver is used to find values for decision variables 
minimizing the cost. The solution is optimal with respect to the cost func- 
tion derived from the available information. However, this cost includes 
only a coarse approximation of the communication time. The communica- 
tion time between any two nodes of the task graph depends on the mapping 
of those nodes to processors and hardware. If both nodes are mapped to 
the same processor, communication will be local and thus quite fast. If the 
nodes are mapped to different hardware components, communication will 
be non-local and may be slower. Modeling communication costs for all pos- 
sible mappings of task graph nodes would make the model very complex 
and is therefore replaced by iterative improvements of the initial solution. 
More details on this step will be presented below. 

7 Iterative improvements: In order to work with good estimates of the com- 
munication time, adjacent nodes mapped to the same hardware component 
are now merged. This merging is shown in fig. 5.16. 
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Figure 5. 1 6. Merging of task nodes mapped to the same hardware component 

We assume that tasks Ti, T2 and T5 are mapped to hardware components 
H1 and H2, whereas T3 and T4 are mapped to processor PI. Accordingly, 
communication between T 3 and T4 is local communication. Therefore, we 
merge T3 and T4, and assume that the communication between the two 
tasks does not require a communication channel. Communication time can 
be now estimated with improved precision. The resulting graph is then used 
as new input for mathematical optimization . The previous and the current 
step are repeated until no more graph nodes are merged. 

8 Interface synthesis: After partitioning, the glue logic required for inter- 
facing processors, application-specific hardware and memories is created. 
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Next, we will describe step 6 in more detail. IP models provide a general 
approach for modeling optimization problems. IP models consist of two parts: 
a cost function and a set of constraints. Both parts involve references to a set 
X = {xi} of integer-valued variables. Cost functions must be linear functions 
of those variables. So, they must be of the general form 



C = ^ aiXi, with a,- G G 2 

Xiex 



(5.1) 



The set J of constraints must also consist of linear functions of integer-valued 
variables. They have to be of the form 



Vy G 7 : ^ btjXi > cj with bij,Cj G R 

Xi^lC 



(5.2) 



Note that > can be replaced by < in equation (5.2) if constants bij are modified 
accordingly. 

Def.: The integer programming (IP-) problem is the problem of minimizing 
cost function (5.1) subject to the constraints given in eq. 5.2. If all variables 
are constrained to being either 0 or 1, the corresponding model is called a 
0/1-integer programming model. In this case, variables are also denoted as 

(binary) decision variables. 

For example, assuming that x \ , X 2 and X3 cannot be negative and must be inte- 
gers, the following set of equations represent a 0/1 -IP model: 



C = 5xi -|- 6x2 + 4x3 


(5.3) 


Xi+X2+Xj,>2 


(5.4) 


Xl < 1 


(5.5) 


X2 < 1 


(5.6) 


X3 < 1 


(5.7) 



Due to the constraints, all variables are either 0 or 1. There are four possible 
solutions. These are listed in table 5.1. The solution with a cost of 9 is optimal. 

Applications requiring maximizing some gain function C' can be changed into 
the above form by setting C = —C'. 

IP models can be solved optimally using mathematical programming tech- 
niques. Unfortunately, integer programming is NP-complete and execution 
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XI 


.t2 


X3 


c 


0 


1 


1 


10 


1 


0 


1 


9 


1 


1 


0 


11 


1 


1 


1 


15 



Table 5.1. Possible solutions of the presented IP-problem 



times may become very large. Nevertheless, it is useful for solving optimiza- 
tion problems as long as the model sizes are not extremely large. Execution 
times depend on the number of variables and on the number and structure of 
the constraints. Good IP solvers (like Ip.solve [Berkelaar and et al., 2005] or 
CPLEX) can solve well-structured problems containing a few thousand vari- 
ables in acceptable computation times (e.g. minutes). For more information 
on integer programming and related linear programming, refer to books on 
the topic (e.g. to Wolsey [Wolsey, 1998]). Modeling optimization problems 
as integer programming problems makes sense despite the complexity of the 
problem: many problems can be solved in acceptable execution times and if 
they cannot, IP models provide a good starting point for heuristics. 

Next, we will describe how partitioning can be modeled using a 0/1 -IP model. 
The following index sets will be used in the description of the IP model: 

■ Index set 1 denotes task graph nodes. Each i G I corresponds to one task 
graph node. 

■ Index set L denotes task graph node types. Each Z G L corresponds to one 
task graph node type. For example, there may be nodes describing square 
root. Discrete Cosine Transform (DCT) or Discrete Fast Fourier Transform 
(DFT) computations. Each of them is counted as one type. 

■ Index set KH denotes hardware component types. Each k G KH corre- 
sponds to one hardware component type. For example, there may be special 
hardware components for the DCT or the DFT. There is one index value for 
the DCT hardware component and one for the FFT hardware component. 

■ For each of the hardware components, there may be multiple copies, or 
“instances”. Each instance is identified by an index j G J. 

■ Index set KP denotes processors. Each k' G KP identifies one of the pro- 
cessors (all of which are of the same type). 

The following decision variables are required by the model: 
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■ Xi^k'- this variable will be 1, if node v, is mapped to hardware component 
type k G KH and 0 otherwise. 

■ this variable will be 1, if node v, is mapped to processor k G KP and 0 
otherwise. 

■ NYpk- this variable will be 1, if at least one node of type I is mapped to 
processor k G KP and 0 otherwise. 

■ r is a mapping I ^ L from task graph nodes to their corresponding types. 

In our particular case, the cost function accumulates the total cost of all hard- 
ware units: 

C = processor costs -i- memory costs -i- cost of application specific hardware 

We would obviously minimize the total cost if no processors, memory and 
application specific hardware were included in fhe “design”. Due fo fhe con- 
sfrainfs, fhis is nol a legal solution. We can now presenf a brief descripfion of 
some of fhe consfrainfs of fhe IP model: 

■ Operation assignment constraints: These constraints guarantee that each 
operation is implemented either in hardware or in software. The corre- 
sponding constraints can be formulated as follows: 



V/G7 : = l 

keKH keKP 

In plain text, this means the following: for all task graph nodes i, the fol- 
lowing must hold: i is implemented either in hardware (setting one of the 
Xi^jc variables to 1, for some k) or it is implemented in software (setting one 
of the Yi^k variables to 1, for some k). 

All variables are assumed to be non-negative integer numbers: 



Xi,k G Wo, (5.8) 

Yu G Wo (5.9) 



Additional constraints ensure that decision variables A,- and ^ have 1 as 
an upper bound and, hence, are in fact 0/1 -valued variables: 
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yi£l :\/k£KH : X,- * < 1 
yi£l:yk£KP : Yi^k < 1 

If the functionality of a certain node of type I is mapped to some proces- 
sor k, then this processors’ instruction memory must include a copy of the 
software for this function: 



yieL,yi:T{vi)=ci,ykeKP: NYi^k > 

In plain text, this means: for all types I of task graph nodes and for all nodes 
i of this type, the following must hold: if i is mapped to some processor k 
(indicated by F,- being 1), then the software corresponding to functionality 
I must be provided by processor k, and the corresponding software must 
exist on that processor (indicated by NYi^k being 1). 

Additional constraints ensure that decision variables NYi^k are also 0/1- 
valued variables: 



'ilGL:yk€KP: NYi^k < 1 

■ Resource constraints: The next set of constraints ensures that “not too 
many” nodes are mapped to the same hardware component at the same 
time. We assume that, for every clock cycle, at most one operation can 
be performed per hardware component. Unfortunately, this means that the 
partitioning algorithm also has to generate a schedule for executing task 
graph nodes. Scheduling by itself is already an NP-complete problem for 
most of the relevant problem instances. 

■ Precedence constraints: These constraints ensure that the schedule for 
executing operations is consistent with the precedence constraints in the 
task graph. 

■ Design constraints: These constraints put a limit on the cost of certain 
hardware components, such as memories, processors or area of application- 
specific hardware. 

■ Timing constraints: Timing constraints, if present in the input to COOL, 
are converted into IP constraints. 

■ Some additional, but less important constraints are not included in this list. 




Implementing embedded systems:hardware/software codesign 



175 




Figure 5 . 1 7. Task graph 

Example: In the following, we will show how these eonstraints ean be gener- 
ated for the task graph in fig. 5.17 (the same as the one in fig. 2.46). 

Suppose fhaf we have a hardware componenf library eonfaining fhree compo- 
nenfs types H1 , H2 and H3 wifh cosfs of 20, 25 and 30 cosf unifs, respectively. 
Furfhermore, suppose fhaf we can also use a processor P of cosf 5. In addifion, 
we assume fhaf fable 5.2 describes fhe execution times of our fasks on fhese 
componenfs. 



T 


H1 


H2 


H3 


P 


1 


20 






100 


2 




20 




100 


3 






12 


10 


4 






12 


10 


5 


20 






100 



Table 5 . 2 . Execution times of tasks Ti to T5 on components 



Tasks Ti fo T 5 can only be execufed on fhe processor or on one applicafion- 
specific hardware unif. Obviously, processors are assumed fo be cheap buf 
slow in executing fasks Ti, T 2 , and T 5 . 

The following operation assignmenf consfrainfs musf be generated, assuming 
fhaf a maximum of one processor (P1) is fo be used: 



1 + Y\ \ = 1 (Task 1 eifher mapped fo H1 or fo P1) 

2 ^ 2,2 + 12.1 = 1 (Task 2 eifher mapped fo H2 or fo P1) 

X 3 3 + T 3.1 = 1 (Task 3 eifher mapped fo H3 or fo P1) 

X 4 3 + T 4.1 = 1 (Task 4 eifher mapped fo H3 or fo P1 ) 

2 ^ 5,1 + T 5.1 = 1 (Task 5 eifher mapped fo H1 or fo P1) 
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Furthermore, assume that the types of tasks Ti to T5 are I = 1,2, 3, 3 and 1, 
respectively. Then, the following additional resource constraints are required: 



A^ki,i>Ti,i (5.10) 

NYij > T2,i 
A^k3,i > T3,1 
A^k3,i > T4,1 

A^ki,i>T5,i (5.11) 

Equation 5.10 means: if task 1 is mapped to the processor, then the function 
Z = 1 must be implemented on that processor. The same function must also be 
implemented on the processor if task 5 is mapped to the processor (eq. 5.11). 

We have not included timing constraints. However, it is obvious that the pro- 
cessor is slow in executing some of the tasks and that application-specific hard- 
ware is required for timing constraints below 100 time units. 

The cost function is: 

C = 20*#(H1) -h25*#(H2) -h30*#(H3) -h5*#(P) 

where #() denotes the number of instances of hardware components. This num- 
ber can be computed from the variables introduced so far if the schedule is also 
taken into account. For a timing constraint of 100 time units, the minimum cost 
design comprises components H1 , H2 and P. This means that tasks T3 and T4 
are implemented in software and all others in hardware. 

In general, due to the complexity of the combined partitioning and scheduling 
problem, only small problem instances of the combined problem can be solved 
in acceptable run-times. Therefore, the problem is heuristically split into the 
scheduling and the partitioning problem: an initial partitioning is based on 
estimated execution times and the final scheduling is done after partitioning. If 
it turns out that the schedule was too optimistic, the whole process has to be 
repeated with tighter timing constraints. Experiments for small examples have 
shown that the cost for heuristic solutions is only 1 or 2 % larger than the cost 
of optimal results. 

Automatic partitioning can be used for analyzing the design space. In the fol- 
lowing, we will present results for an audio lab, including mixer, fader, echo, 
equalizer and balance units. This example uses earlier target technologies in 
order to demonstrate the effect of partitioning. The target hardware consists 
of a (slow) SPARC processor, external memory, and application-specific hard- 
ware to be designed from an (outdated) l,w ASIC library. The total allowable 
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delay is set to 22675 ns, corresponding to a sample rate of 44. 1 kHz, as used 
in CDs. Fig. 5.18 shows different design points which can be generated by 
changing the delay constraint. 




3060 12470 18600 28510 41420 48080 60410 72900 

Time [ns] 



Figure 5.18. Design space for audio lab 

The unit X refers to a technology-dependent length unit. It is essentially one 
half of the closest distance between the centers of two metal wires on the chip 
(also called half-pitch [SEMATECH, 2003]). The design point at the left cor- 
responds to a solution implemented completely in hardware, the design point 
at the right to a software solution. Other design points use a mixture of hard- 
ware and software. The one corresponding to an area of 78.4 is the cheapest 
meeting the deadline. 

Obviously, technology has advanced to allow a 100% software-based audio 
lab design nowadays. Nevertheless, this example demonstrates the underlying 
design methodology which can also be used for more demanding applications, 
especially in the high-speed multimedia domain, such as MPEG-4. 

5.4 Compilers for embedded systems 
5.4.1 Introduction 

Obviously, optimizations and compilers are available for the processors used 
in PCs and compiler generation for commonly used 32-bit processors is well 
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understood. For embedded systems, standard compilers are also used in many 
cases, since they are typically cheap or even freely available. 

However, there are several reasons for designing special optimizations and 
compilers for embedded systems: 

■ Processor architectures in embedded systems exhibit special features (see 
page 100). These features should be exploited by compilers in order to 
generate efficient code. 

■ High levels of optimization are more important than high compilation speed. 

■ Compilers could potentially help to meet and prove real-time constraints. 
For example, it may be beneficial fo freeze cerfain cache lines in order fo 
prevenf frequenlly executed code from being evicted and reloaded several 
limes. 

■ Compilers may help fo reduce fhe energy consumption of embedded sys- 
tems. Compilers performing energy oplimizalions should be available. 

■ For embedded sysfems, Ihere is a larger variefy of insfruclion sels. Hence, 
fhere are more processors for which compilers should be available. Some- 
times Ihere is even fhe requesl fo supporf fhe optimization of insfruclion sels 
wilh retargetable compilers. For such compilers, fhe insfruclion sef can be 
specified as an inpuf fo a compiler generalion system. Such systems can 
be used for experimenlally modifying insfruclion sels and Ihen observing 
fhe resulling changes for fhe generaled machine code. This is one partic- 
ular case of design space exploration and is supporled, for example, by 
Tensilica fools [Tensilica Inc., 2003]. 

Some firsl approaches for relargelable compilers are described in fhe firsl book 
on Ihis topic [Marwedel and Goossens, 1995]. Oplimizalions can be found 
in more recenl books by Leupers [Leupers, 1997], [Leupers, 2000a]. In Ihis 
seclion, we will presenl examples of compilation lechniques for embedded 
processors. 

Compilation lechniques mighl also have to supporl compression techniques 
described on pages 103 to 105. 

5.4.2 Energy-aware compilation 

Many embedded systems are mobile systems which have to run on batteries. 
While compulalional demands on mobile systems are increasing, battery tech- 
nology is expected to improve only slowly [SEMATECH, 2003]. Hence, Ihe 
availabilily of energy is a serious bottleneck for new applications. 
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Saving energy can be done at various levels, including the fabrication process 
technology, the device technology, circuit design, the operating system and 
the application algorithms. Adequate translation from algorithms to machine 
code can also help. High-level optimization techniques such as those presented 
on pages 157 to 167 can also help to reduce the energy consumption. In this 
section, we will look at compiler optimizations which can reduce the energy 
consumption. Power models are very essential ingredients of all power opti- 
mizations. A general problem of power models is their frequently very limited 
precision"*. 

■ One of the first power models was proposed by Tiwari [Tiwari et al., 1994]. 
The model includes so-called base costs and inter-instruction costs. Base 
costs of an instruction correspond to the energy consumed per instruction 
execution if an infinite sequence of that instruction is executed. Inter- 
instruction costs model the additional energy consumed by the processor 
if instructions change. This additional energy is required, for example, due 
to switching functional units on and off. This power model focuses on the 
consumption in the processor and does not consider the power consumed 
in the memory or in other parts of the system. 

■ Another power model was proposed by Simunic et al. [Simunic et al., 
1999]. That model is based on data sheets. The advantage of this approach 
is that the contribution of all components of an embedded system to the 
energy consumption can be computed. However, the information in data 
sheets about average values may be less precise than the information about 
maximal or minimal values. 

■ A third model has been proposed by Rusell and Jacome [Russell and Ja- 
come, 1998]. This model is based on precise measurements of two fixed 
configurations. 

■ Still another model was proposed by Lee [Lee et al., 2001]. This model in- 
cludes an detailed analysis of the effects of the pipeline. It does not include 
multicycle operations and pipeline stalls. 

■ The cnee energy-aware compiler from Dortmund University uses the en- 
ergy model by Steinke et al. [Steinke et al., 2001]. It is based on precise 
measurements using real hardware. The consumption of the processor as 
well as that of the memory in included. 

■ The energy consumption of caches can be computed with CACTI [Wilton 
andJouppi, 1996]. 



^Deviations of about 50% are frequently mentioned in discussions. 
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Using models like the one above, the following compiler optimizations have 

been used for reducing the energy consumption: 

■ Energy-aware scheduling: the order of instructions can be changed as 
long as the meaning of the program does not change. The order can be 
changed such that the number of transitions on the instruction bus is min- 
imized. This optimization can be performed on the output generated by a 
compiler and therefore does not require any change to the compiler. 

■ Energy-aware instruction selection: typically, there are different instruc- 
tion sequences for implementing the same source code. In a standard com- 
piler, the number of instructions or the number of cycles is used as a cri- 
terion (cost function) for selecting a good sequence. This criterion can be 
replaced by the energy consumed by that sequence. Steinke and others 
found that low-power instruction selection reduces the energy consumption 
by some percent. 

■ Replacing the cost function is also possible for other standard compiler 
optimizations, such as register pipelining, loop invariant code motion etc. 
Possible improvements are also in the order of a few percent. 

■ Exploitation of the memory hierarchy: As explained on page 118, smaller 
memories provide faster access and consume less energy per access. There- 
fore, a significant amount of energy can be saved if the existence of small 
scratch pad memories (SPMs) can be exploited by a compiler. For this 
purpose, each basic block and each variable can be modeled as a memory 
segment i. For each segment, there is a corresponding size 5,. Using profil- 
ing, it is possible to compute the gain gi of moving segment i to the scratch 
pad memory. Let 



Xi 



1 if segment i is mapped to the SPM 

0 otherwise 



(5.12) 



Then, the goal is to maximize 



'^grXi (5.13) 

i 

while respecting the size constraint 



Y^srXi<K 



(5.14) 
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where K is the size of the SPM. 

This problem is known as a knapsack problem. The solution of this problem 
is a one-to-one mapping. An integer programming model leading to such a 
mapping was presented by Steinke et al. [Steinke et ah, 2002b]. For some 
benchmark applications, energy reductions of up to about 80% were found, 
even though the size of the SPM was just a small fraction of the total code 
size of the application. Results for the bubble sort program are shown in 
fig. 5.19. 




Figure 5. 1 9. Energy reduction by compiler-based mapping to scratch-pad for bubble sort 

Obviously, larger SPMs lead to a reduced energy consumption in the main 
memory. The energy required in the processor is also reduced, since less 
wait cycles are required. Supply voltages have been assumed to be constant. 

Code can also be dynamically copied into the SPM, resulting in a many-to- 
one mapping. An integer programming model reflecting this more general 
optimization problem was also proposed by Steinke et al. [Steinke et al., 
2002a]. Using this more general model, the energy gain can be increased, 
especially for applications for which the SPM is too small to contain all hot 
spots. 

Of all the compiler optimizations analyzed by Steinke, the energy savings 
enabled by memory hierarchies are the largest. 

5.4.3 Compilation for digital signal processors 

Features of DSP processors are described on page 108. Compilers should ex- 
ploit these. Techniques for this can be demonstrated using address generation 
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units as examples. This possibility of generating addresses “for free” has an 
important impact on how variables should be laid out in memory. Fig. 5.20 
shows an example. 



0 Pil LOAD A, 1b 

A+=2 ; d 

1 A-=3 ;a 

2 c A+=2 ; c 

A++ ; d 

3 d A — ; c 




Figure 5.20. Comparison of memory layouts 

We assume that in some basic block, variables a to d are accessed in the se- 
quence (b,d,a,c,d,c). Accessing these variables with register-indirect address- 
ing requires, first of all, loading the address of b into an address register (see 
fig. 5.20, left). The instruction referring to variable b is not shown in fig. 5.20, 
since the current focus is on address generation. Therefore, the generation of 
the address for the access to the next variable (d) is considered next. Assuming 
that there is just a single address register A, A has to be updated to point to vari- 
able d. This requires adding 2 to the register. Again, we ignore the instruction 
loading the variable, and we immediately consider the access to a. For this, we 
have to subtract 3, and for the next access we have to add 2. Assuming that 
the auto-increment and -decrement range is restricted to ± 1, only the last two 
accesses shown in fig. 5.20 can be implemented with these operations. In total, 
4 instructions for calculating addresses are needed. 

In contrast, for the layout in fig. 5.20 (right), 4 address calculations are auto- 
increment and -decrement operations which will be executed in parallel with 
some operation in the main data path. Only 2 cycles are needed for address 
calculations with an offset larger than 1. Again, the instructions actually using 
the variables are not shown. 

How do we generate such clever memory layouts? Algorithms doing this typ- 
ically start from an access graph (see fig. 5.21). 




graph linear path memory layout 

Figure 5.21. Memory allocation for access sequence fb, d, a, c, d, c) for a single address 
register A 
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Such access graphs have one node for each of the variables and have an edge 
for every pair of variables for which there are adjacent accesses. The weight 
of such edges corresponds to the number of adjacent accesses to the variables 
connected by that edge. 

Variables connected by an edge of a high weight should preferably be allocated 
to adjacent memory locations. The number of address calculations saved in this 
way is equal to the weight of the corresponding edge. For example, if c and d 
are allocated to adjacent locations, then the last two accesses in the sequence 
can be implemented with auto-increment and -decrement operations. 

The overall goal of memory allocation is to find a linear order of variables 
in memory maximizing the use of auto-increment and -decrement operations. 
This corresponds to finding a linear pafh of maximum weigh! in fhe variable 
access graph. Unforfunafely, fhe maximum weighfed pafh problem in graphs 
is NP-complefe. Hence, if is common to use heuristics for generating such 
paths [Liao et ah, 1995b], [Sudarsanam et ah, 1997]. Most of them are based 
on KruskaTs spanning tree heuristic. They start with a graph with no edges 
and then incrementally add edges with decreasing weight, always keeping the 
degree of all nodes to at most 2 and avoiding cycles. The order of the variables 
in memory will then correspond to the order of the variables along the linear 
path. 

The algorithm just sketched only covers a simple case. Extensions of this 
algorithm cover more complex situations, such as: 

■ n> \ address registers [Leupers and Marwedel, 1996], 

■ also using modify registers present in the AGU [Leupers and Marwedel, 
1996], [Leupers and David, 1998], 

■ extension to arrays [Basu et ah, 1999], 

■ larger auto-increment and -decrement ranges [Sudarsanam et ah, 1997]. 

Memory allocation, as described above, improves both the code-size and the 
run-time of the generated code. Other proposed optimization algorithms ex- 
ploit further architectural features of DSP processors, such as: 

■ multiple memory hanks [Sudarsanam and Malik, 1995], 

■ heterogeneous register files [Araujo and Malik, 1995], 

■ modulo addressing, 

■ insfrucfion level parallelism [Leupers and Marwedel, 1995], 
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■ multiple operation modes [Liao et al, 1995a]. 

Other, new optimization techniques are described by Leupers [Leupers, 2000a]. 

5.4.4 Compilation for multimedia processors 

In order to fully support packed data types as described on page 110, com- 
pilers must be able to automatically convert operations in loops to operations 
on packed data types. Taking advantage of this potential is necessary for gen- 
erating efficient software. A very challenging task is to use this feature in 
compilers. Compiler algorithms exploiting operations on packed data types 
are extensions of vectorizing algorithms originally developed for supercom- 
puters, but only some algorithms have been described so far [Fisher and Dietz, 
1998], [Fisher and Dietz, 1999], [Leupers, 2000b], [Krall, 2000], [Larsen and 
Amarasinghe, 2000]. 

Automatic parallelization of loops for the M3-DSP (see page 113) requires the 
use of vectorization techniques, which achieve significant speedups (compared 
to the case of sequential operations, see fig. 5.22) [Lorenz ef ah, 2002]. For 
application dof_product_2, fhe size of fhe vecfors was loo small lo lead lo a 
speedup and no veclorizalion should be performed. The number of cycles can 
be reduced by 94 % for benchmark example if veclorizalion is combined wilh 
an exploilalion of zero-overhead-loop inslruclions. 



rei. number 
of cycles [%] 




Figure 5.22. Reduction of the cycle count by vectorization for the M3-DSP 



5.4.5 Compilation for VLIW processors 

VLIW archileclures (see page 111) require special compiler optimizations: 
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■ A key optimization required for TMS 320C6xx eompilers is to allocate, at 
compile time, the functional unit that should execute a certain operation. 
Due to the two data paths (see fig. 3.21), this implies a partitioning of the 
operations into two sets [Jacome and de Veciana, 1999], [Jacome et ah, 
2000], [Leupers, 2000c] and also includes an allocation to one of the regis- 
ter files. 

■ VLIW processors frequently have branch delay slots. For VLIW proces- 
sors, the branch delay penalty is significantly larger than for other proces- 
sors, because each of the branch delay slots could hold a full instruction 
packet, not just a single instruction. For example, for the TMS 320C6xx, 
the branch delay penalty is 5 x 8 = 40 instructions. In order to avoid this 
large penalty, most VLIW processors support predicated execution for a 
large number of condition code registers. Predicated execution can be em- 
ployed to efficiently implement small if-statements. For large if-statements, 
however, conditional branches are more efficient, since these allow mutual 
exclusion of then- and else-branches to be exploited in hardware alloca- 
tion. The precise tradeoff between the two methods for implementing if- 
statements can be found with proper optimization techniques [Mahlke et ah, 
1992], [August et ah, 1997], [Leupers, 1999]. 

■ Due to the large branch delay penalty, inlining (see page 167) is another 
optimization that is very useful for VLIW processors. 

5.4.6 Compilation for network processors 

Network processors are a new type of processors. They are optimized for 
high-speed Internet applications. Their instruction sets comprise numerous 
instructions for accessing and processing bit fields in sfreams of information. 
Typically, fhey are programmed in assembly languages, since fheir fhrough- 
puf is of ufmosf imporfance. Neverfheless, nefwork profocols are becoming 
more and more complex and designing compilers for such processors supporfs 
fhe design of nefwork componenfs. The necessary bif-level defails have been 
analyzed by Wagner ef al. [Wagner and Leupers, 2002]. Wagner obfained a 
28% performance gain by exploifing special bif-level insfrucfions of a nefwork 
processor. 

5.4.7 Compiler generation, retargetable compilers 
and design space exploration 

When fhe firsf compilers were designed, compiler design was a fofally manual 
process. In fhe meantime, some of fhe sfeps involved in generafing a compiler 
have been aufomafed or supporfed by fools. For example, lex and yacc and 
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more recent versions of these tools (see http://www.combo.org/lex_yacc_page) 
provide a standard means for parsing the source code. Generating machine 
instructions is another step which is now supported by tools. For example, 
tree pattern matchers such as olive [Sudarsanam, 1997] can be used for this 
task. Despite the use of such tools, compiler design is typically not a fully 
automated process. 

However, there have been many attempts to design retargetable compilers. We 
distinguish between different kinds of retargetability: 

■ Developer retargetability: In this case, compiler specialists are responsi- 
ble for retargeting compilers to new instruction sets. 

■ User retargetability: In this case, users are responsible for retargeting the 
compiler. This approach is much more challenging. 

More information about retargetable compilers and their use for design space 
exploration can be found in a book by Leupers and Marwedel [Leupers and 
Marwedel, 2001]. 

5.5 Voltage Scaling and Power Management 
5.5.1 Dynamic Voltage Scaling 

Some embedded processors support dynamic voltage scheduling and dynamic 
power management (see page 102). An additional optimization step can be 
used to exploit these features. Typically, such an optimization step follows 
code generation by the compiler. These optimizations require a global view of 
all tasks of the system, including their dependencies, slack times etc. 

The potential of dynamic voltage scheduling is demonstrated by the following 
example [Ishihara and Yasuura, 1998]. We assume that we have a processor 
which runs at three different voltages, 2.5 V, 4.0 V, and 5.0 V. Assuming an 
energy consumption of 40 nJ per cycle at 5.0 V, equation 3.1 can be used to 
compute the energy consumption at the other voltages (see table 5.3, where 25 
nJ is a rounded value). 



Vdd [V] 


5.0 


4.0 


2.5 


Energy per cycle [nJ] 


40 


25 


10 


fnmx [MHz] 


50 


40 


25 


cycle time [ns] 


20 


25 


40 



Table 5.3. Characteristics of processor with DVS 
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A minimum energy consumption is achieved for the ideal supply voltage of 
4 Volts. In the following, we use the term variable voltage processor only 
for processors that allow any supply voltage up to a certain maximum. It is 
expensive to support truly variable voltages, and therefore, actual processors 
support only a few fixed voltages. 

The observations made for the above example can be generalized into the fol- 
lowing statements. The proofs of these statements are given in the paper by 
Ishihara and Yasuura. 

■ If a variable voltage processor completes a task before the deadline, the 
energy consumption can be reduced^ . 

■ If a processor uses a single supply voltage v and completes a task T just 
at its deadline, then v is the unique supply voltage which minimizes the 
energy consumption of T. 

■ If a processor can only use a number of discrete voltage levels, then a volt- 
age schedule with at most two voltages minimizes the energy consumption 
under any time constraint. 

■ If a processor can only use a number of discrete voltage levels, then the 
two voltages which minimize the energy consumption are the two imme- 
diate neighbors of the ideal voltage Videai possible for a variable voltage 
processor. 

The statements can be used for allocating voltages to tasks. Next, we will 
consider the allocation of voltages to a set of tasks. We will use the following 
notation: 

N : the number of tasks 

ECj : the number of executed cycles of task j 

L : the number of voltages of the target processor 

Vi : the 1th voltage, with I <i<L 

Fi : the clock frequency for supply voltage V, 

T : the global deadline at which all tasks must have been completed 

Xi j : the number of clock cycles task j is executed at voltage Vi 

SCj : the average switching capacitance during the execution of task j (SCi comprises 

the actual capacitance Cl and the switching activity a (see eq. 3.1 on page 102)) 

The voltage scaling problem can then be formulated as an integer programming 
(IP) problem (see page 171). Simplifying assumptions of the IP-model include 
the following: 



^This formulation makes an implicit assumption in lemma 1 of the paper by Ishihara and Yasuura explicit. 
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■ There is one target processor that can be operated at a limited number of 
discrete voltages. 

■ The time for voltage and frequency switches is negligible. 

■ The worst case number of cycles for each task are known. 

Using these assumptions, the IP-problem can be formulated as follows: 
Minimize 



E 



N L 



j=\ i=\ 



(5.18) 



subject to 



L 



i=l 



= ECj 



(5.19) 



and 



N L 



II 

7=1 i=l 



Fi 



< T 



(5.20) 



The goal is to find the number Xij of cycles that each task j is executed at a 
certain voltage Vi . According to the statements made above, no task will ever 
need more than two voltages. Using this model, Ishihara and Yasuura show 
that efficiency is typically improved if tasks have a larger number of voltages 
to choose from. If large amounts of slack time are available, many voltage 
levels help to find close to optimal voltage levels. However, four voltage levels 
do already give good results quite frequently. 

There are many cases in which tasks actually run faster than predicted by their 
worst case execution times. This cannot be exploited by the above algorithm. 
This limitation can be removed by using checkpoints at which actual and worst 
case execution times are compared, and then to use this information to poten- 
tially scale down the voltage [Azevedo et ah, 2002]. Also, voltage scaling in 
multi -rate task graphs was recently proposed [Schmitz et ah, 2002]. 



5.5.2 Dynamic power management (DPM) 

In order to reduce the energy consumption, we can also take advantage of 
power saving states, as introduced on page 101. The essential question for ex- 
ploiting DPM is: when should we go to a power-saving state? Straight-forward 
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approaches just use a simple timer to transition into a power-saving state. More 
sophisticated approaches model the idle times by stochastic processes and use 
these to predict the use of subsystems with more accuracy. Models based on 
exponential distributions have been shown to be inaccurate. Sufficiently accu- 
rate models include those based on renewal theory [Simunic et ah, 2000]. 

A comprehensive discussion of power management was published by Benini 
et al. [Benini and Micheli, 1998]. There are also advanced algorithms which 
integrate DVS and DPM into a single optimization approach for saving energy 
[Simunic et ah, 2001]. 

Allocating voltages and computing transition times for DPM may be two of 
the last steps of optimizing embedded software. 

5.6 Actual design flows and tools 
5.6.1 SpecC methodology 

Chapter 2 includes a brief description of the SpecC language (see page 76). 
Fig. 5.25 shows a design flow adopted for the SpecC -based SoC methodology 
[Gajski et ah, 2000], [Gerstlauer et ah, 2001]. 

This methodology starts with specification capture in SpecC. The SpecC spec- 
ification model is executable. Accordingly, simulations can be used to validate 
and analyze the model as well as to estimate certain key design parameters. The 
next step is architecture exploration. This step comprises allocation, partition- 
ing and scheduling. Allocation consists of selecting components (processing 
elements (processors, intellectual property components, or custom hardware), 
memories, busses) from a library. The next step is partitioning. Partitioning 
denotes the mapping of parts of the system specification onto the components. 
Variables are mapped to memories, channels to busses, and behaviors to pro- 
cessing elements. Scheduling is used to serialize the execution. Fig. 5.25 de- 
scribes the flow of information. The actual design exploration will consist of a 
number of steps that are consistent with this flow. Architecture exploration is 
followed by design validation (in fact, validation and estimation will typically 
be intermixed). 

In communication synthesis, abstract busses will be replaced by actual wires in 
a series of refinements. In the backend, software compilers are used to generate 
binary machine code and hardware synthesis tools are used to generate custom 
hardware. 

A design flow similar to the one shown is supported by the SoC Environment 
(SCE) that is available from the University at Irvine. Eurther information can 
be found in the SCE documentation [Center for Embedded Computer Systems, 
2003]. 
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Figure 5.25. Codesign methodology possible with SpecC 



5.6.2 IMEC tool flow 

The design flow proposed by the “Intemniversitair Miero-Electronica Cen- 
trum” (IMEC), Eeuven (Belgium) is shown in fig. 5.26. Aecording to this 
design flow, specifieations can be represented in UME, Java, and Concurrent 
C++. 

■ The first set of tools, developed in the context of the Matisse/Dynamic 
Memory Management (DMM) project, considers the system at the con- 
current process level as a set of concurrent and dynamic processes, whose 
specification consists of four types of elements: algorithms, abstract data 
types, communication primitives, and real-time requirements. Tools at this 
level are able to perform source code transformations on the dynamic data 
types (and their access functions) and provide also a memory pool organi- 
zation in the virtual memory space. 
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Figure 5.26. Global view of IMEC design flow 

■ The second set of tools, developed in the context of the Matador/Task Con- 
currency Management (TCM) project, is again considering a system of con- 
current processes. For these tools, the emphasis is on mapping tasks to 
processors. Different configurations of multi-processor systems are eval- 
uated and curves of designs that are non-inferior to others are generated. 
These curves provide a view of the design space, and are the basis for final 
design decisions. Wong et al. [Wong et ah, 2001] describe configurations 
for a personal MPEG-4 player. The authors assume that a combination of 
StrongArm processors and custom accelerators is to be used and they found 
4 configurations that satisfy the timing constraint of 30 ms (see table 5.4). 



Processor combination 


1 


2 


3 


4 


Number of high speed processors 


6 


5 


4 


3 


Number of low speed processors 


0 


3 


5 


7 


Total number of processors 


6 


8 


9 


10 



Table 5 . 4 . Processor configurations 



For combinations 1 and 4, the authors report that only one allocation of 
tasks to processors meets the timing constraints. For combinations 2 and 
3, different time budgets lead to different task to processor mappings and 
different energy consumptions. 

Design space exploration is based on the concept of Pareto curves, as 
shown in fig. 5.27 for configurafions 2 and 3. Each line indicafes a separa- 
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Figure 5.27. Pareto curves for processor combinations 2 and 3 



tion of the design space into two subspaces. For example, the area above 
the dashed line (configuration 3) corresponds to design points that are infe- 
rior to the design points found for that configuration. For any design in that 
area, one could improve either the performance, the energy consumption or 
both by using the design points found for configuration 3. Hence, whenever 
task to processor mapping leads to a design point in that area, it is ignored^. 

For configuration 3, design point 6 is the fastest design that can be gener- 
ated. If the deadline is set to less than about 25.5 ms, configuration 2 has to 
be used. The overall Pareto curve is obtained as the best of the Pareto curves 
for configurations 2 and 3. The concept of Pareto curves is frequently used 
for design space exploration, not just for the IMEC design flow. 

TCM tools also address the storage and transfer of data between dynami- 
cally created tasks (they include a “task-level” version of the Data Transfer 
and Storage Exploration (DTSE) tools described next. 

■ The next design transformations are the subject of research in the Data 
Transfer and Storage Exploration (DTSE) project. A number of phases is 
proposed [Miranda et ah, 2004], [IMEC, 2003] aiming at a reduction of 
the data transfers between processing components and at a reduction of the 
storage requirements. 

■ DTSE optimizations generate quite complex addressing, including modulo 
operations. Addressing is subsequently simplified in address optimization 
(ADOPT) tools [Miranda et ah, 1998], [Ghez et ah, 2000]. 



®IMEC has proposed to use Pareto curve information also for scheduling at run-time. 
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■ The resulting code can be used as input to compilers or as input to the final 
set of IMEC tools, designed in the OCAPI-XL project. These tools support 
the mapping of applications to reconfigurable hardware (see page 115). 

5.6.3 The COSYMA design flow 

COSYMA (cosynthesis for embedded micro-^chitectures) [Osterling et ah, 
1997] is a set of tools for the design of embedded systems. The COSYMA 
design flow is shown in fig. 5.28. 




Figure 5.28. COSYMA design flow 

COSYMA sfarfs wifh a specification comprising a sef of programs wriffen in 
a slighfly exfended version of C, called C^. The synfax for each Cjc-program 
is essenfially fhaf of C, exfended by a process header. Inferprocess communi- 
cation is based on predefined C-funcfions which are lafer mapped fo physical 
channels. In addition, fhe specificafion includes consfrainfs and user directives 
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contained in a separate file. Finally, tool-specific directives for particular tools 
can be provided as input. 

Programs are analyzed by a compiler front-end built from the SUIF set of tools 
[The SUIF group, 2003]. This front-end maintains all the information included 
in the program code and makes it available in an internal data structure. The 
next step is profiling. Profiling identifies the hot spots of the application pro- 
grams and provides the information needed for the optimizations that follow. 
Early versions of COSYMA used simulation-based profiling. Later versions 
also include profiling based on analytical models. 

The next step is scheduling. This step is skipped (and therefore shown with 
dashed lines) if the input contains only a single process. For multiple pro- 
cesses there are two approaches. With the first approach, scheduling generates 
a single process from the original set of processes. With the second, newer 
approach, scheduling is integrated into hardware/software partitioning. The 
granularity used for partitioning is that of basic blocks (maximal sequences 
of code containing no branches, except possibly at the end). Since partition- 
ing has to take communication costs into account, a detailed analysis of the 
information flow in and out of basic blocks is required. 

Hardware at the block level (arithmetic units, multiplexers, etc.) is generated 
by the Braunschweig high-level synthesis system (BSS). The output of this 
system is fed into the commercial Design Compiler from Synopsys, generat- 
ing gate-level descriptions. These descriptions are represented in the form of 
VHDL structural descriptions. 

Binary object code is generated using a standard compiler for the target proces- 
sor. The resulting embedded system consists of both hardware and software. 
Initial versions of COSYMA supported only mono-processor systems. More 
recent versions also support multi-processor systems. A final run-time analysis 
(taking communication delays into account) verifies the timing constraints. 

The design flow is similar to that of COOL. However, COSYMA is a more 
comprehensive system resulting from the effort of a much larger group. 

5.6.4 Ptolemy II 

The Ptolemy project [Davis et ah, 2001] focuses on modeling, simulation, and 
design of heterogeneous systems. Emphasis is on embedded systems that mix 
technologies, for example analog and digital electronics, hardware and soft- 
ware, and electrical and mechanical devices. Ptolemy supports different types 
of applications, including signal processing, control applications, sequential 
decision making, and user interfaces. Special attention is paid to the gener- 
ation of embedded software. The idea is to generate this software from the 
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model of computation which is most appropriate for a certain application. Ver- 
sion 2 of Ptolemy (Ptolemy II) supports the following models of computation 
and corresponding domains (see also page 17): 

1 Communicating sequential processes (CSP). 

2 Continuous time (CT): This model is appropriate for mechanical systems 
and analog circuits. It is supported through a set of extensible differential 
equation solvers. 

3 Discrete event model (DE): this is the model used by many simulators, e.g. 
VHDL simulators. 

4 Distributed discrete events (DDE). Discrete event systems are difficult to 
simulate in parallel, due to the inherent centralized queue of future events. 
Attempts to distribute this data structure have not been very successful so 
far. Therefore, this special (experimental) domain is introduced. Semantics 
can be defined such fhaf disfribufed simulation becomes more efficienf fhan 
in fhe DE model. 

5 Einife sfafe machines (ESM). 

6 Process nefworks (PN), using Kahn process nefworks (see page 53). 

7 Synchronous dafallow (SDE). 

8 Synchronous/reacfive (SR) model of compufafion. This model uses discrefe 
time, buf signals do nof need fo have a value af every clock tick. Esferel (see 
page 79) is a language following fhis style of modeling. 

This lisf clearly shows fhe focus on differenf models of compufafion in fhe 
Pfolemy projecf. 

5.6.5 The OCTOPUS design flow 

The OCTOPUS design flow [Awad ef ah, 1996] is complefely dedicafed to- 
wards fhe design of embedded soffware, assuming appropriafe wrappers for 
hardware. If was observed fhaf fhere is a poor mafch befween fhe focus of 
objecf-orienfed design fechniques on fhe soffware objecf sfrucfure and fhe need 
fo allocafe operations fo fasks. This poor mafch was fhe main concern fhaf was 
addressed in fhe design of OCTOPUS. OCTOPUS is used wifhin Nokia. Ifs 
design flow includes fhe following phases: 

1 In fhe systems requirement phase, fhe behavior of fhe sysfem is described 
by use case diagrams (see page 48) and use cases. The sfrucfure of fhe 
environmenf is described by a so-called confexf diagram. 
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2 In the system architecture phase, the strueture of the system is broken 
down into subsystems. Major interfaees between the subsystems are iden- 
tified, but the behavior of the subsystems is not. 

3 The subsystem analysis phase is done for every subsystem. In this phase, 
elass diagrams for the subsystems are generated. The behavior of the sub- 
systems ean be defined in various ways, ineluding StateCharts, so-ealled 
event lists and event sheets. 

4 The result of the next phase, the subsystem design phase, includes outlines 
for processes/threads, classes and interprocess messages. 

5 In the final phase, called subsystem implementation phase, actual code of 
the selected programming language is generated 

Obviously, this flow is very much influenced by software technology, with an 

adaptation towards distributed systems. 




Chapter 6 



VALIDATION 



6.1 Introduction 

One very important aspect of embedded system design has not been considered 
so far: validation. Validation is the process of checking whether or not a certain 
(possibly partial) design is appropriate for its purpose, meets all constraints and 
will perform as expected. Validation is important for any design procedure, and 
hardly any system would work as expected, had it not been validated during the 
design process. Validation is extremely important for safety-critical embedded 
systems. 

In theory, we could try to design verified tools which always generate correct 
implementations from the specification. In practice, this verification of tools 
does not work, except in very simple cases. As a consequence, each and ev- 
ery design has to be verified. In order fo minimize fhe number of limes lhal 
we have lo verify a design, we could fry fo verify if af fhe very end of fhe 
design process. Unforlunalely, fhis approach normally does nol work, due lo 
fhe large differences belween fhe level of abslracfion used for fhe specificalion 
and lhal used for fhe implemenlalion. Therefore, validation is required al var- 
ious phases during fhe design procedure (see fig. 6.1). Validation and design 
should be inlerlwined and nol be considered as Iwo complefely independenl 
activities^ . 

If would be nice lo have a single validation technique applicable lo all vali- 
dation problems. In practice, none of Ihe available techniques solves all Ihe 



*The same is true for design evaluation, as depicted in fig. 6.1. However, design evaluation is not covered 
in this book. 
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Figure 6.1. Simplified design information flow 

problems, and a mix of techniques has to be applied. In this chapter, we will 
provide a brief overview over the key techniques which are available. 

6.2 Simulation 

Simulations are a very common technique for validating designs. Simulations 
consist of executing a design model on appropriate computing hardware, typ- 
ically on general purpose digital computers. Obviously, this requires models 
to be executable. All the executable languages introduced in chapter 2 can be 
used in simulations, and they can be used at various levels as described starting 
at page 79. 

The level at which designs are simulated is always a compromise between 
simulation speed and accuracy. The faster the simulation, the less accuracy is 
available. 

So far, we have used the term behavior in the sense of the functional behavior 
of systems (its input/output behavior). There are also simulations of some 
non-functional behaviors of designs, including the thermal behavior and the 
electro-magnetic compatibility (EMC) with other electronic equipment. 

For embedded systems, simulations have serious limitations: 

■ Simulations are typically a lot slower than the actual design. Hence, if we 
tried to interface the simulator with the actual environment, we would have 
quite a number of violations of timing constraints. 

■ Simulations in the real environment may even be dangerous (who would 
want to drive a car with unstable control software?). 
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■ For many applications, there may be huge amounts of data and it may be 
impossible to simulate enough data in the available time. Multimedia appli- 
cations are notoriously known for this. For example, simulating the com- 
pression of some video stream takes an enormous amount of time. 

■ Most actual systems are too complex to allow simulating all possible cases 
(inputs). Hence, simulations can help us to find errors in our designs. They 
cannot guarantee absence of errors, since simulations cannot exhaustively 
be done for all possible combinations of inputs and internal states. 

Due to the limitations, there is an increased emphasis on formal verification 
(see page 209). 

6.3 Rapid Prototyping and Emulation 

There are many cases in which the designs should be tried out in realistic en- 
vironments before final versions are manufaclured. Confrol sysfems in cars 
are an excellenf example for Ibis. Such sysfems should be used by drivers 
in differenl environmenfs before mass producfion is sfarfed. Accordingly, fhe 
car indusfry designs profofypes. These profofypes should essentially behave 
like fhe final systems, buf fhey may be larger, more power consuming and have 
ofher properties which fesf drivers can accepf. Such prototypes can be builf, for 
example, using FPGAs. Racks confaining FPGAs can be stored in fhe frank 
while fesf drivers exercise fhe car. 

This approach is nol limited to fhe car indusfry. There are several ofher cases 
in which profofypes are builf from FPGAs. Commercially available emulators 
consisf of a large number of FPGAs. They come wifh fhe required mapping 
fools which map specificalions to fhese emulators. Using fhese emulators, ex- 
perimenfs wifh systems which behave “almost” like the final sysfems can be 
run. 

6.4 Test 
6.4.1 Scope 

In fesfing, we are applying a sef of specially selected inpuf pafferns, so-called 
test patterns to the input of the system, observe its behavior and compare this 
behavior with the expected behavior. Test patterns are normally applied to the 
real, already manufactured system. The main purpose of testing is to identify 
systems that have not been correctly manufactured (manufacturing test) and to 
identify systems that fail later (field test). 

Testing includes a number of different actions: 
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1 test pattern generation, 

2 test pattern application, 

3 response observation, and 

4 result comparison. 



In test pattern generation, we try to identify a set of test patterns which distin- 
guish correctly working from incorrectly working systems. Test pattern gen- 
eration is based on fault models. Such fault models are models of possible 
faults. For example, it is possible to use the stuck-at-fault model, which is 
based on the assumption, that any internal wire of an electronic circuit is either 
permanently connected to ’0’ or ’T (this is the so-called stuck-at model). It 
has been observed that many faults actually behave as if some wire was perma- 
nently connected that way. However, recent CMOS technologies require more 
comprehensive fault models. These include transient faults and delay faults 
(faults changing the delay of a circuit) explicitly. While good fault models ex- 
ist for hardware testing, the same is not true for software testing. Test pattern 
generation tries to generate tests for all faults that are possible according to a 
certain fault model. The quality of the test pattern set can be evaluated using 
the fault coverage. Fault coverage is the percentage of potential faults that can 
be found for a given test pattern set: 



Coverage 



Number of detectable faults for a given test pattern set 
Number of faults possible due to the fault model 



In practice, achieving a good product quality requires fault coverages in the 
area of 98 to 99 %. 

In order to increase the number of options that exist for system validation, it 
has been proposed to use test methods already during the design phase. For 
example, test pattern sets can be applied to software models of systems in 
order to check if two software models behave in the same way. More time- 
consuming formal methods need to be applied only to those cases in which 
this test-based equivalence check did not fail. 



6.4.2 Design for testability 

If testing comes in only as an afterthought, it may be very difficult to test a 
system. For example, verifying whether or not two finite state machines are 
equivalent may require complex homing sequences [Kohavi, 1987] (sequences 
returning the FSM to some initial state). In order to simplify tests, special 
hardware can be added such that testing becomes easier. The process of de- 
signing for better testability is called design for testability, or DfT. Special 
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purpose hardware for testing finite state machines is a prominent example of 
this. Reaching certain states and observing states resulting from the applica- 
tion of input patterns is very much simplified wifh scan design. In scan design, 
all flip-flops storing slates are connecled to form serial shifl registers (see fig. 
6.2). 




Figure 6.2. Scan path design 

Selling Ihe muxes to scan mode, we can load any slale info Ihe Ihree flip-flops 
serially. In a second phase, we can apply inpul pallerns to Ihe FSM while Ihe 
muxes are sel fo normal mode. As a resull, Ihe FSM will be in a new slale. 
This new slate can be serially shifted oul in Ihe Ihird and final phase, using Ihe 
serial mode again. The nel effecl is lhal we do nol need to worry aboul how to 
gel into cerlain slates and how to observe whelher or nol 5 has been correclly 
implemented while we are generating lesls for Ihe FSM. A safe way of testing 
FSM Iransilions is to firsl shifl Ihe “old” slate into Ihe shifl register chain, Ihen 
to apply Ihe inpul, and finally to shifl Ihe resulting new slate oul of Ihe scan 
chain. Effectively, Ihe facl lhal we are dealing wilh slate-based systems has 
an impacl only on Ihe Iwo (simple) shifl phases, and lesl pattern generation for 
(slateless) Boolean nelworks can be used for checking for correcl oulpuls. This 
means lhal il is suflicienl to use lesl pattern generation melhods for Boolean 
functions (slateless nelworks) instead of caring aboul homing sequences elc. 

Scan design is a technique which works well for single chips. For board-level 
integration il is necessary to have some technique for connecting scan chains 
of several chips. JTAG is a slandard which does exaclly Ibis. The slandard 
defines registers al Ihe boundaries of all chips and a number of lesl pins and 
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control commands such that all chips can be connected in scan chains. JTAG 
is also known as boundary scan [Parker, 1992]. 

For chips with a large number of flip-flops, in can take quite some time to set 
and read all of them. In order to speed up the process of generating patterns 
on the chip, it has been proposed to also integrate hardware for generating test 
patterns on the chip. Typically pseudo-random patterns, generated by registers 
with feed-back paths are used as test patterns. 

In order to also avoid shifting out the response of the circuit under test, re- 
sponses are compacted. Compacted responses behave very much like cyclic 
redundancy check (CRC) characters in that the probability of generating cor- 
rect compacted test responses from an incorrect response can be made very 
low (about 2^” where n is the number of bits in the compacted response). 

The built-in logic block observer (BILBO) [Kdnemann et ah, 1979] has been 
proposed as a circuit combining test pattern generation, test response com- 
paction and serial input/output capabilities. A BILBO with three D-type flip- 
flops is shown in fig. 6.3. 




Figure 6.3. BILBO 

Modes of BILBO registers are shown in table 6.1. The 3-bit register shown 
in fig. 6.3 can be in scan path, reset, linear-feedback shift register (LFSR) and 
normal mode. In LFSR mode, it can be used for either generating pseudo- 
random pattern or for compacting responses from inputs (Zq to Z 2 ). 

Typically, BILBOs are used in pairs. One BILBO generates pseudo-random 
test patterns, feeding some Boolean network with these patterns. The response 
of the Boolean network is then compressed by a second BILBO connected to 
the output of the network. At the end of the test sequence, the compacted 
response is serially shifted out and compared with the expected response. 
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Table 6.1. Modes of BILBO registers 



DfT hardware is of great help during the prototyping and debugging of hard- 
ware. It is also useful to have DfT hardware in the final produet, sinee hardware 
fabrieation never has a zero defeet rate. Testing fabrieated hardware signifi- 
cantly contributes to the overall cost of a product and mechanisms that reduce 
this cost are highly appreciated by all companies. 

6.4.3 Self-test programs 

One of the key problems of testing modern integrated circuits is their limited 
number of pins, making it more and more difficult to access internal compo- 
nents. Also, it is getting very difficult to test these circuits at full speed, since 
testers must be at least as fast as the circuits themselves. The fact that many 
embedded systems are based on processors provides a way out of this dilemma: 
processors are capable of running test programs or diagnostics. Such diagnos- 
tics have been used to test main frame machines for decades. Fig. 6.4 shows 
some components that might be contained in some processor. 



instruction 




register 


register 




file 



stuck-at-O-error 



Al^ 



Figure 6 . 4 . Segment from processor hardware 

The types of faults that are considered are part of the fault model. In order 
to test for stuck-at-faults at the input of the ALU, we can execute a small test 
program: 

store pattern of all ’1’s in register file; 

perform xor between constant "0000... 00" and register, 

test if result contains ’0’ bit, 
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if yes, report error; 

otherwise start test for next stuck-at-fault 

Similar small programs can be generated for other stuck-at-errors. Unfortu- 
nately, the process of generating diagnostics for main frames has mostly been a 
manual one. Some researchers have proposed to generate diagnostics automat- 
ically [Brahme and Abraham, 1984], [Kruger, 1986], [Bicker and Marwedel, 
1995], [Krstic and Dey, 2002]. 

6.5 Fault simulation 

It is currently not feasible (and it will probably not be feasible) to completely 
predict the behavior of systems in the presence of faults. Therefore, the behav- 
ior of systems in the presence of faults is frequently simulated. This type of 
simulation is called fault simulation. In fault simulation, system models are 
modified to reflect the behavior of the system in the presence of a certain fault. 

The goals of fault simulation include: 

■ to know the effect of a fault of the components at the system level. Faults 
are called redundant if they do not affect the observable behavior of the 
system, and 

■ to know whether or not mechanisms for improving fault tolerance actually 
help. 

Fault simulation requires the simulation of the system for all faults feasible 
for the fault model and also for a possibly large number of different input pat- 
terns. Accordingly, fault simulation is an extremely time-consuming process. 
Different techniques have been proposed to speed up fault simulation. These 
techniques include parallel fault simulation. Parallel fault simulation is espe- 
cially effective if the system is modeled at the gate level. In this case, internal 
signals are single bit signals. This fact enables the mapping of a signal to a 
single bit of some machine word of a simulating host machine. AND- and OR- 
machine instructions can then be used to simulate Boolean networks. However, 
only a single bit would be used per machine word. Efficiency is improved with 
parallel fault simulation. In parallel fault simulation, n different test patterns 
are simulated at the same time, if n is the machine word size. The values of 
each of the n test patterns are mapped to a different bit position in the machine 
word. Executing the same set of AND- and OR-instructions will then simulate 
the behavior of the Boolean network for n test patterns instead of for just one. 
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6.6 Fault injection 

Fault simulation may be too time-consuming for real systems. If actual systems 
are available, fault injection can be used instead. In fault injection, real existing 
systems are modified and the overall effect on the system behavior is checked. 
Fault injection does not rely on fault models (even though they can be used). 
Hence, fault injection has the potential of generating faults that would not have 
been predicted by a fault model. 

We can distinguish between two types of fault injection: 

■ local faults within the system, and 

■ faults in the environment (behaviors which do not correspond to the specifi- 
cation). For example, we can check how the system behaves if it is operated 
outside the specified femperafure or radiation ranges. 

Several mefhods can be used for faulf injection: 

■ Faulf injecfion af fhe hardware level: Examples include pin-manipulafion, 
elecfromagnefic and nuclear radiafion. 

■ Faulf injecfion af fhe soffware level: Examples include foggling some mem- 
ory bifs. 

According fo experimenfs reporfed by Kopefz [Kopefz, 1997], soffware-based 
faulf injecfion was essentially as effective as hardware-based faulf injecfion. 
Nuclear radiafion was a noficeable excepfion in fhaf if generafed errors which 
were nof generafed wifh ofher mefhods. 

6.7 Risk- and dependability analysis 

Embedded sysfems (like many producfs) can cause damages fo properfies and 
lives. If is nof possible fo reduce fhe risk of damages fo zero. The besf fhaf 
we can do is fo make fhe probabilify of damages small, hopefully orders of 
magnifude smaller fhan ofher risks. Eor many applications, a probabilify of a 
cafasfrophe has fo be less fhan 10^^ per hour [Kopefz, 1997], corresponding 
fo one case per 100,000 sysfems operating for 10,000 hours. Damages are 
resulting from hazards. Eor each possible damage fhere is a severify (fhe 
cosf) and a probabilify. Risk can be defined as fhe producf of fhe fwo. 

Risks can be analyzed wifh several fechniques [Dunn, 2002], [Press, 2003]: 

■ Fault tree Analysis (FTA): ETA is a fop-down mefhod of analyzing risks. 
The analysis sfarfs wifh a possible damage and fhen fries fo come up wifh 




208 



EMBEDDED SYSTEM DESIGN 



possible scenarios that lead to that damage. FTA typically uses a graphical 
representation of possible damages, including symbols for AND- and OR- 
gates. OR-gates are used if a single event could result in a hazard. AND- 
gates are used when several events or conditions are required for that hazard 
to exist. Fig. 6.5 shows an example. 




OS hazard 



Figure 6. 5. Fault tree 

The simple AND- and OR-gates cannot model all situations. For exam- 
ple, their modeling power is exceeded if shared resources of some limited 
amount (like energy or storage locations) exist. Markov models [Bremaud, 
1999] may have to be used to cover such cases. 

■ Failure mode and effect analysis (FMEA): FMEA starts at the compo- 
nents and tries to estimate their reliability. Using this information, the re- 
liability of the system is computed from the reliability of its parts (corre- 
sponding to a bottom-up analysis). The first step is to create a table con- 
taining components, possible faults, probability of faults and consequences 
on the system behavior. Risks for the system as a whole are then computed 
from the table. Table 6.2 shows an example^. 

Tools supporting both approaches are available. Both approaches may be used 
in “safety cases”. In such cases, an independent authority has to be convinced 
that certain technical equipment is indeed safe. One of the commonly re- 
quested properties of technical systems is that no single failing component 
should potentially cause a catastrophe. 



^Realistic fault probabilities are unknown at the time of writing. 
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Component 


Failure Consequences 


Probability 


Critical? 


Processor 


metal migration no service 


“Icr^Th 


yes 



Table 6.2. FMEA table 



6.8 Formal Verification 

Formal verification is concerned with formally proving a system correct, using 
the language of mathematics. First of all, a formal model is required to make 
formal verification applicable. This step can hardly be automated and may 
require some effort. Once the model is available, we can try to prove certain 
properties. 

Formal verification techniques can be classified by the type of logics employed: 

■ Propositional logic: In this case, models consist of Boolean formulas de- 
scribed with Boolean variables and connectives such as and and or. (State- 
less) gate-level logic networks can be conveniently described with proposi- 
tional logic. Tools typically aim at checking if two models represented this 
way are equivalent. Such tools are called tautology checkers or equiva- 
lence checkers. Since propositional logic is decidable, it is also decidable 
whether or not the two representations are equivalent (there will be no cases 
of doubt). For example, one representation might correspond to gates of an 
actual circuit and the other to its specification. Proving the equivalence then 
proves the effect of all design transformations (for example, optimizations 
for power or delay) to be correct. Tautology checkers can frequently cope 
with designs which are too large to allow simulation-based exhaustive val- 
idation. The key reason for the power of recent tautology checkers is the 
use of Binary Decision Diagrams (BDDs) [Wegener, 2000]. The complex- 
ity of equivalence checks of Boolean functions represented with BDDs is 
linear in the number of BDD-nodes. In contrast, the equivalence check for 
functions represented by sums of products is NP-hard. Still, the number 
of BDD-nodes required to represent a certain function has to be taken into 
account. Many functions can be efficiently represented with BDDs. In the 
general, however, the number of nodes of BDDs grows exponentially with 
the number of variables. In those cases in which functions can be efficiently 
represented with BDDs, BDD-based equivalence checkers have frequently 
replaced simulators and are used to verify gate networks with millions of 
transistors. The ability to also verify finite state-machines is very much 
limited. 
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■ First order logic (FOL): FOL includes quantification, using 3 and V. 
Some automation for verifying FOL models is feasible. However, since 
FOL is undecidable in general, there may be cases of doubt. 

■ Higher order logic (HOL): Higher order allows functions to be manipu- 
lated like other objects (see http://archive.comlab.ox.ac.uk/formal-methods/ 
hol.html). For higher order logic, proofs can hardly ever be automated and 
typically must be done manually with some proof-support. 

Model checking 

Verification of finite state machines can be performed with model checking. 

Model checking aims at the verification of properties of finite state systems. It 

analyzes the state space of the system. Verification using this approach requires 

three stages: 

1 the generation of a model of the system to be verified, 

2 the definition of the properties expected from the system, and 

3 model checking (the actual verification step). 

Accordingly, model checking systems accept the model and properties as input 

(see fig. 6.6). 




proof or 

counterexample 



Figure 6. 6. Inputs for model checking 

Verification tools can prove or disprove the properties. In the latter case, they 
can provide a counter-example. Model checking is easier to automate than 
FOL. Languages used for the definition of properties typically allow the quan- 
tification of states. 

A popular system for model checking is Clarke’s EMC-system [Clarke and 
et al., 2003]. This system accepts properties to be described as CTL formu- 
las. CTL stands for “Computational Tree Logics”. CTL-formulas include two 
parts: 
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■ a path quantifier (this part specifies paths in the state transition diagram), 
and 

■ a state quantifier (this part specifies sfafes). 

Example: M,s\= AGg means: In fhe fransifion graph M, properfy g holds for 
all pafhs (denoted by A) and all slates (denoles by G). 

In 1987, model checking was implemented using BDDs. If was possible lo 
locale several errors in fhe specificalion of the. future bus prolocol. 

Extensions are needed in order lo also cover real-lime behavior and numbers. 
More informalion on formal verification can be found in books on Ihis topic 
(refer to, for example, Ihe books by Kropf [Kropf, 1999] and Clarke el al. 
[Clarke el ah, 2000]). 
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